Re: [gdal-dev] filter out wanted messages in a grib2 file
Thanks, Even! Can gdalinfo also get all the bands info needed as GDALGetRasterCount() + GDALGetRasterBand() + GDALGetMetadata() + GDALGetDescription() ) work together? > On Mar 19, 2019, at 12:44 PM, Even Rouault wrote: > >> On dimanche 10 mars 2019 21:19:30 CET Zhan Zhang - NOAA Affiliate wrote: >> I have a grib2 file which contains many messages, and those messages define >> different products on different surfaces (like z axis). For >> instance, some messages defines "soil temperature"(product name) on a >> surface called "depth below land surface" (surface name); and other >> messages define "geopotential height" (product name) on a "pressure >> surface" (surface name); etc. May I ask how I can filter out all those >> messages that defines "soil temperature"(product name) on a surface called >> "depth below land surface" (surface name)? Is there some grib2 tools >> provided in gdal APIs that I can use? Thanks! > > Zhan, > > Multiple GRIB messages in a single file are seen by GDAL as a multi-band > dataset. > So you need to iterate over the bands ( GDALGetRasterCount() + > GDALGetRasterBand() for the C API ) and fetch its metadata ( > GDALGetMetadata() > ) an description ( GDALGetDescription() ) to see if the band is of interest > for you. > > The output of gdalinfo should help you understanding this > > If using the Python API, you may also analyze the result of > gdal.Info('my.grb2', format = 'json') > which is a Python dictionary, to retrieve the band you are interested in. > > Best regards, > > Even > > -- > Spatialys - Geospatial professional services > http://www.spatialys.com ___ gdal-dev mailing list gdal-dev@lists.osgeo.org https://lists.osgeo.org/mailman/listinfo/gdal-dev
Re: [gdal-dev] filter out wanted messages in a grib2 file
when I say "filter out", I mean "retrieve". On Sun, Mar 10, 2019 at 9:19 PM Zhan Zhang - NOAA Affiliate < zhan.j.zh...@noaa.gov> wrote: > I have a grib2 file which contains many messages, and those messages > define different products on different surfaces (like z axis). For > instance, some messages defines "soil temperature"(product name) on a > surface called "depth below land surface" (surface name); and other > messages define "geopotential height" (product name) on a "pressure > surface" (surface name); etc. May I ask how I can filter out all those > messages that defines "soil temperature"(product name) on a surface called > "depth below land surface" (surface name)? Is there some grib2 tools > provided in gdal APIs that I can use? Thanks! > ___ gdal-dev mailing list gdal-dev@lists.osgeo.org https://lists.osgeo.org/mailman/listinfo/gdal-dev
[gdal-dev] filter out wanted messages in a grib2 file
I have a grib2 file which contains many messages, and those messages define different products on different surfaces (like z axis). For instance, some messages defines "soil temperature"(product name) on a surface called "depth below land surface" (surface name); and other messages define "geopotential height" (product name) on a "pressure surface" (surface name); etc. May I ask how I can filter out all those messages that defines "soil temperature"(product name) on a surface called "depth below land surface" (surface name)? Is there some grib2 tools provided in gdal APIs that I can use? Thanks! ___ gdal-dev mailing list gdal-dev@lists.osgeo.org https://lists.osgeo.org/mailman/listinfo/gdal-dev
Re: [gdal-dev] grib driver
May I ask whether gdal grib driver support changing grid in a GRIB2 message? For instance, given a GRIB2 message, I want to change from polar stereographic grid projection to lambert conformal grid projection, and the data section need also be changed accordingly, and then output this to a new GRIB2 message. I ask MDL degrib group and they tell me degrib can not do this. It sound that it is likely gdal grib driver cannot do this either. However I am not 100% sure and want to seek some comments here. Thanks! On Thu, Feb 28, 2019 at 10:27 AM Even Rouault wrote: > On jeudi 28 février 2019 09:29:41 CET Zhan Zhang - NOAA Affiliate wrote: > > I am new to the gdal apis and is interested in getting some knowledge of > > the grib driver. May I ask whether it basically provides the same > > functionality as degrib from MDL/NWS/NOAA? Thanks! --Zhan > > The GDAL GRIB driver strongly relies on degrib+g2clib. It esposes their > functionality through the general GDAL raster API, so as to be able to > manipulate a GRIB dataset as any other GDAL raster. > > Even > > -- > Spatialys - Geospatial professional services > http://www.spatialys.com > ___ gdal-dev mailing list gdal-dev@lists.osgeo.org https://lists.osgeo.org/mailman/listinfo/gdal-dev
[gdal-dev] grib driver
I am new to the gdal apis and is interested in getting some knowledge of the grib driver. May I ask whether it basically provides the same functionality as degrib from MDL/NWS/NOAA? Thanks! --Zhan ___ gdal-dev mailing list gdal-dev@lists.osgeo.org https://lists.osgeo.org/mailman/listinfo/gdal-dev
[jira] [Created] (SPARK-23306) Race condition in TaskMemoryManager
Zhan Zhang created SPARK-23306: -- Summary: Race condition in TaskMemoryManager Key: SPARK-23306 URL: https://issues.apache.org/jira/browse/SPARK-23306 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.1 Reporter: Zhan Zhang There is race condition in TaskMemoryManger, which may cause OOM. The memory released may be taken by another task because there is a gap between releaseMemory and acquireMemory, e.g., UnifiedMemoryManager, causing the OOM. if the current is the only one that can perform spill. It can happen to BytesToBytesMap, as it only spill required bytes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21505) A dynamic join operator to improve the join reliability
[ https://issues.apache.org/jira/browse/SPARK-21505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236365#comment-16236365 ] Zhan Zhang commented on SPARK-21505: Any comments on this feature? Do you think the design is OK? If so, we are going to submit a PR. > A dynamic join operator to improve the join reliability > --- > > Key: SPARK-21505 > URL: https://issues.apache.org/jira/browse/SPARK-21505 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0, 2.3.0, 3.0.0 >Reporter: Lin >Priority: Major > Labels: features > > As we know, hash join is more efficient than sort merge join. But today hash > join is not so widely used because it may fail with OutOfMemory (OOM) error > due to limited memory resource, data skew, statistics mis-estimation and so > on. For example, if we apply shuffle hash join on an uneven distributed > dataset, some partitions might be so large that we cannot make a Hash table > for this particular partition causing OOM error. When OOM happens, current > Spark technology will throw an Exception, resulting in job failure. On the > other hand, if sort-merge join is used, there will be shuffle, sorting and > extra spill, causing the degradation of the join. Considering the efficiency > of hash join, we want to propose a fallback mechanism to dynamically use hash > join or sort-merge join at runtime at task level to provide a more reliable > join operation. > This new dynamic join operator internally implements the logic of HashJoin, > Iterator Reconstruct, Sort, and MergeJoin. We show the process of this > dynamic join method as following: > HashJoin: We start from building Hash table on one side of join partitions. > If Hash table is built successfully, it would be the same as the current > ShuffledHashJoin operator. > Sort: If we fail to build Hash table due to the large partition size, we do > SortMergeJoin only on this partition. But we need to rebuild the When OOM > happens, a Hash table corresponding to partial part of this partition has > been built successfully (e.g. first 4000 rows of RDD), and the iterator of > this partition is now pointing to the 4001st row of partition. We reuse this > hash table to reconstruct the iterator for the first 4000 rows and > concatenate with the rest rows of this partition so that we can rebuild this > partition completely. On this re-built partition, we apply sorting based on > key values. > MergeJoin: After getting two sorted Iterators, we perform regular merge join > against them and emits the records to downstream operators. > Iterator Reconstruct: BytesToBytesMap has to be spilled to disk to release > the memory for other operators, such as Sort, Join, etc. In addition, it has > to be converted to Iterator, so that it can be concatenated with remaining > items in the original iterator that is used to build the hash table. > Meta Data Population: Necessary metadata, such as sorting keys, jointype, > etc, has to be populated, so that they are used for potential Sort and > MergeJoin operator. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21492) Memory leak in SortMergeJoin
[ https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095425#comment-16095425 ] Zhan Zhang commented on SPARK-21492: root cause: In the SortMergeJoin, inner/leftOuter/rightOuter, one side of the SortedIter may not exhausted, that chunk of the memory thus cannot be released, causing memory leak and performance degradtion. > Memory leak in SortMergeJoin > > > Key: SPARK-21492 > URL: https://issues.apache.org/jira/browse/SPARK-21492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Reporter: Zhan Zhang > > In SortMergeJoin, if the iterator is not exhausted, there will be memory leak > caused by the Sort. The memory is not released until the task end, and cannot > be used by other operators causing performance drop or OOM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21492) Memory leak in SortMergeJoin
Zhan Zhang created SPARK-21492: -- Summary: Memory leak in SortMergeJoin Key: SPARK-21492 URL: https://issues.apache.org/jira/browse/SPARK-21492 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Zhan Zhang In SortMergeJoin, if the iterator is not exhausted, there will be memory leak caused by the Sort. The memory is not released until the task end, and cannot be used by other operators causing performance drop or OOM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20215) ReuseExchange is boken in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-20215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-20215: --- Comment: was deleted (was: Seems to be fixed in SPARK-20229) > ReuseExchange is boken in SparkSQL > -- > > Key: SPARK-20215 > URL: https://issues.apache.org/jira/browse/SPARK-20215 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Reporter: Zhan Zhang >Priority: Minor > > Currently if we have query like: A join B Union A join C... with the same > join key. Table A will be scanned multiple times in sql. It is because the > megastoreRelation are not shared by two joins, and ExprId is different. > canonicalized in Expression will not be able to unify them and two Exchange > will not compatible and cannot be reused. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20215) ReuseExchange is boken in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-20215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968734#comment-15968734 ] Zhan Zhang commented on SPARK-20215: Seems to be fixed in SPARK-20229 > ReuseExchange is boken in SparkSQL > -- > > Key: SPARK-20215 > URL: https://issues.apache.org/jira/browse/SPARK-20215 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Reporter: Zhan Zhang >Priority: Minor > > Currently if we have query like: A join B Union A join C... with the same > join key. Table A will be scanned multiple times in sql. It is because the > megastoreRelation are not shared by two joins, and ExprId is different. > canonicalized in Expression will not be able to unify them and two Exchange > will not compatible and cannot be reused. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20215) ReuseExchange is boken in SparkSQL
Zhan Zhang created SPARK-20215: -- Summary: ReuseExchange is boken in SparkSQL Key: SPARK-20215 URL: https://issues.apache.org/jira/browse/SPARK-20215 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Zhan Zhang Priority: Minor Currently if we have query like: A join B Union A join C... with the same join key. Table A will be scanned multiple times in sql. It is because the megastoreRelation are not shared by two joins, and ExprId is different. canonicalized in Expression will not be able to unify them and two Exchange will not compatible and cannot be reused. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20006) Separate threshold for broadcast and shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-20006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931054#comment-15931054 ] Zhan Zhang edited comment on SPARK-20006 at 3/18/17 4:42 AM: - The default ShuffledHashJoin threshold can fallback to the broadcast one. A separate configuration does provide us opportunities to optimize the join dramatically. It would be great if CBO can automatically find the best strategy. But probably I miss something. Currently the CBO does not collect right statistics, especially for partitioned table. I have opened a JIRA for that issue as well. https://issues.apache.org/jira/browse/SPARK-19890 was (Author: zhzhan): The default ShuffledHashJoin threshold can fallback to the broadcast one. A separate configuration does provide us opportunities to optimize the join dramatically. It would be great if CBO can automatically find the best strategy. But probably I miss something. Currently the CBO does not collect right statistics, especially for partitioned table. https://issues.apache.org/jira/browse/SPARK-19890 > Separate threshold for broadcast and shuffled hash join > --- > > Key: SPARK-20006 > URL: https://issues.apache.org/jira/browse/SPARK-20006 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Reporter: Zhan Zhang >Priority: Minor > > Currently both canBroadcast and canBuildLocalHashMap use the same > configuration: AUTO_BROADCASTJOIN_THRESHOLD. > But the memory model may be different. For broadcast, currently the hash map > is always build on heap. For shuffledHashJoin, the hash map may be build on > heap(longHash), or off heap(other map if off heap is enabled). The same > configuration makes the configuration hard to tune (how to allocate memory > onheap/offheap). Propose to use different configuration. Please comments > whether it is reasonable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20006) Separate threshold for broadcast and shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-20006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931054#comment-15931054 ] Zhan Zhang commented on SPARK-20006: The default ShuffledHashJoin threshold can fallback to the broadcast one. A separate configuration does provide us opportunities to optimize the join dramatically. It would be great if CBO can automatically find the best strategy. But probably I miss something. Currently the CBO does not collect right statistics, especially for partitioned table. https://issues.apache.org/jira/browse/SPARK-19890 > Separate threshold for broadcast and shuffled hash join > --- > > Key: SPARK-20006 > URL: https://issues.apache.org/jira/browse/SPARK-20006 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Reporter: Zhan Zhang >Priority: Minor > > Currently both canBroadcast and canBuildLocalHashMap use the same > configuration: AUTO_BROADCASTJOIN_THRESHOLD. > But the memory model may be different. For broadcast, currently the hash map > is always build on heap. For shuffledHashJoin, the hash map may be build on > heap(longHash), or off heap(other map if off heap is enabled). The same > configuration makes the configuration hard to tune (how to allocate memory > onheap/offheap). Propose to use different configuration. Please comments > whether it is reasonable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20006) Separate threshold for broadcast and shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-20006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-20006: --- Description: Currently both canBroadcast and canBuildLocalHashMap use the same configuration: AUTO_BROADCASTJOIN_THRESHOLD. But the memory model may be different. For broadcast, currently the hash map is always build on heap. For shuffledHashJoin, the hash map may be build on heap(longHash), or off heap(other map if off heap is enabled). The same configuration makes the configuration hard to tune (how to allocate memory onheap/offheap). Propose to use different configuration. Please comments whether it is reasonable. was: Currently both canBroadcast and canBuildLocalHashMap use the same configuration: AUTO_BROADCASTJOIN_THRESHOLD. But the memory model may be different. For broadcast, currently the hash map is always build on heap. For shuffledHashJoin, the hash map may be build on heap(longHash), or off heap(other map if off heap is enabled). The same configuration makes the configuration hard to tune (how to allocate memory onheap/offheap). Propose to use different configuration. > Separate threshold for broadcast and shuffled hash join > --- > > Key: SPARK-20006 > URL: https://issues.apache.org/jira/browse/SPARK-20006 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Reporter: Zhan Zhang >Priority: Minor > > Currently both canBroadcast and canBuildLocalHashMap use the same > configuration: AUTO_BROADCASTJOIN_THRESHOLD. > But the memory model may be different. For broadcast, currently the hash map > is always build on heap. For shuffledHashJoin, the hash map may be build on > heap(longHash), or off heap(other map if off heap is enabled). The same > configuration makes the configuration hard to tune (how to allocate memory > onheap/offheap). Propose to use different configuration. Please comments > whether it is reasonable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20006) Separate threshold for broadcast and shuffled hash join
Zhan Zhang created SPARK-20006: -- Summary: Separate threshold for broadcast and shuffled hash join Key: SPARK-20006 URL: https://issues.apache.org/jira/browse/SPARK-20006 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Zhan Zhang Priority: Minor Currently both canBroadcast and canBuildLocalHashMap use the same configuration: AUTO_BROADCASTJOIN_THRESHOLD. But the memory model may be different. For broadcast, currently the hash map is always build on heap. For shuffledHashJoin, the hash map may be build on heap(longHash), or off heap(other map if off heap is enabled). The same configuration makes the configuration hard to tune (how to allocate memory onheap/offheap). Propose to use different configuration. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19908) Direct buffer memory OOM should not cause stage retries.
Zhan Zhang created SPARK-19908: -- Summary: Direct buffer memory OOM should not cause stage retries. Key: SPARK-19908 URL: https://issues.apache.org/jira/browse/SPARK-19908 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 2.1.0 Reporter: Zhan Zhang Priority: Minor Currently if there is java.lang.OutOfMemoryError: Direct buffer memory, the exception will be changed to FetchFailedException, causing stage retries. org.apache.spark.shuffle.FetchFailedException: Direct buffer memory at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:40) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731) at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692) at org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854) at org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887) at org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511
[jira] [Updated] (SPARK-19890) Make MetastoreRelation statistics estimation more accurately
[ https://issues.apache.org/jira/browse/SPARK-19890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-19890: --- Description: Currently the MetastoreRelation statistics is retrieved on the analyze phase, and the size is based on the table scope. But for partitioned table, this statistics is not useful as table size may 100x+ larger than the input partition size. As a result, the join optimization techniques is not applicable. It would be great if we can postpone the statistics to the optimization phase to get partition information but before physical plan generation phase so that JoinSelection can choose better join methd (broadcast, shuffledjoin, or sortmerjoin). Although the metastorerelation does not associated with partitions, but through PhysicalOperation we can get the partition info for the table. Multiple plan can use the same meatstorerelation, but the estimation is still much better than table size. This way, retrieving statistics is straightforward. Another possible way is to have a another data structure associating the metastore relation and partitions with the plan to get most accurate estimation. was: Currently the MetastoreRelation statistics is retrieved on the analyze phase, and the size is based on the table scope. But for partitioned table, this statistics is not useful as table size may 100x+ larger than the input partition size. As a result, the join optimization techniques is not applicable. It would be great if we can postpone the statistics to the optimization phase to get partition information but before physical plan generation phase so that JoinSelection can choose better join methd (broadcast, shuffledjoin, or sortmerjoin). Although the metastorerelation does not associated with partitions, but through PhysicalOperation we can get the partition info for the table. Although multiple plan can use the same meatstorerelation, but the estimation still much better than table size. This way, retrieving statistics is straightforward. Another possible way is to have a another data structure associating the metastore relation and partitions with the plan to get most accurate estimation. > Make MetastoreRelation statistics estimation more accurately > > > Key: SPARK-19890 > URL: https://issues.apache.org/jira/browse/SPARK-19890 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Reporter: Zhan Zhang >Priority: Minor > > Currently the MetastoreRelation statistics is retrieved on the analyze phase, > and the size is based on the table scope. But for partitioned table, this > statistics is not useful as table size may 100x+ larger than the input > partition size. As a result, the join optimization techniques is not > applicable. > It would be great if we can postpone the statistics to the optimization phase > to get partition information but before physical plan generation phase so > that JoinSelection can choose better join methd (broadcast, shuffledjoin, or > sortmerjoin). > Although the metastorerelation does not associated with partitions, but > through PhysicalOperation we can get the partition info for the table. > Multiple plan can use the same meatstorerelation, but the estimation is still > much better than table size. This way, retrieving statistics is > straightforward. > Another possible way is to have a another data structure associating the > metastore relation and partitions with the plan to get most accurate > estimation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19890) Make MetastoreRelation statistics estimation more accurately
Zhan Zhang created SPARK-19890: -- Summary: Make MetastoreRelation statistics estimation more accurately Key: SPARK-19890 URL: https://issues.apache.org/jira/browse/SPARK-19890 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Zhan Zhang Priority: Minor Currently the MetastoreRelation statistics is retrieved on the analyze phase, and the size is based on the table scope. But for partitioned table, this statistics is not useful as table size may 100x+ larger than the input partition size. As a result, the join optimization techniques is not applicable. It would be great if we can postpone the statistics to the optimization phase to get partition information but before physical plan generation phase so that JoinSelection can choose better join methd (broadcast, shuffledjoin, or sortmerjoin). Although the metastorerelation does not associated with partitions, but through PhysicalOperation we can get the partition info for the table. Although multiple plan can use the same meatstorerelation, but the estimation still much better than table size. This way, retrieving statistics is straightforward. Another possible way is to have a another data structure associating the metastore relation and partitions with the plan to get most accurate estimation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19839) Fix memory leak in BytesToBytesMap
[ https://issues.apache.org/jira/browse/SPARK-19839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897975#comment-15897975 ] Zhan Zhang commented on SPARK-19839: When BytesToBytesMap spills, its longArray should be released. Otherwise, it may not released until the task complete. This array may take a significant amount of memory, which cannot be used by later operator, such as UnsafeShuffleExternalSorter, resulting in more frequent spill in sorter. This patch release the array as destructive iterator will not use this array anymore. > Fix memory leak in BytesToBytesMap > -- > > Key: SPARK-19839 > URL: https://issues.apache.org/jira/browse/SPARK-19839 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 > Reporter: Zhan Zhang > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19839) Fix memory leak in BytesToBytesMap
Zhan Zhang created SPARK-19839: -- Summary: Fix memory leak in BytesToBytesMap Key: SPARK-19839 URL: https://issues.apache.org/jira/browse/SPARK-19839 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Zhan Zhang -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19815) Not orderable should be applied to right key instead of left key
[ https://issues.apache.org/jira/browse/SPARK-19815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895522#comment-15895522 ] Zhan Zhang commented on SPARK-19815: I am thinking the logic again. On the surface, the logic may be correct. Since in the join, the left and right key should be the same type. Please close this JIRA. > Not orderable should be applied to right key instead of left key > > > Key: SPARK-19815 > URL: https://issues.apache.org/jira/browse/SPARK-19815 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Reporter: Zhan Zhang >Priority: Minor > > When generating ShuffledHashJoinExec, the orderable condition should be > applied to right key instead of left key. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19815) Not orderable should be applied to right key instead of left key
[ https://issues.apache.org/jira/browse/SPARK-19815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-19815: --- Summary: Not orderable should be applied to right key instead of left key (was: Not order able should be applied to right key instead of left key) > Not orderable should be applied to right key instead of left key > > > Key: SPARK-19815 > URL: https://issues.apache.org/jira/browse/SPARK-19815 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Reporter: Zhan Zhang >Priority: Minor > > When generating ShuffledHashJoinExec, the orderable condition should be > applied to right key instead of left key. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19815) Not order able should be applied to right key instead of left key
Zhan Zhang created SPARK-19815: -- Summary: Not order able should be applied to right key instead of left key Key: SPARK-19815 URL: https://issues.apache.org/jira/browse/SPARK-19815 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Zhan Zhang Priority: Minor When generating ShuffledHashJoinExec, the orderable condition should be applied to right key instead of left key. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19354) Killed tasks are getting marked as FAILED
[ https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862588#comment-15862588 ] Zhan Zhang commented on SPARK-19354: This fix is actually critical. In production, we found that this behavior can cause job retry and failure especially speculation is enabled. /cc [~rxin] Specifically we observe that: When sorter spill to disk, the task is killed. Then a interruptedExecption is thrown. Then OOM will be thrown, which cause the unhandledexception in executor, and eventually shutdown the executor. It happens a lot in speculative tasks. With healthy tasks in the same executor marked as failed as well. Retries will happen. Even worse, such retries may fail again due to same reason, eventually causing job failure. 17/02/11 15:39:38 ERROR TaskMemoryManager: error while calling spill() on org.apache.spark.shuffle.sort.ShuffleExternalSorter@714b17b0 java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269) at org.apache.spark.storage.DiskBlockObjectWriter.commitAndGet(DiskBlockObjectWriter.scala:178) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:186) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:254) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:171) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:245) at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:359) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:382) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:241) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:162) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.shuffle.sort.ShuffleExternalSorter@714b17b0 : null at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:180) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:245) at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:359) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:382) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:241) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:162) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) > Killed tasks are getting marked as FAILED > - > > Key: SPARK-19354 > URL: https://issues.apache.org/jira/browse/SPARK-19354 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Reporter: Devaraj K >Priority: Minor > > When we enable speculation, we can see there are multiple attempts running > for the same task when the first task progress is slow. If any of the task > attempt succeeds then the other attempts will be killed, during killing the > attempts those attempts are getting marked as failed due to the below error. > We need to handle this error and mark the at
[jira] [Commented] (SPARK-13450) SortMergeJoin will OOM when join rows have lot of same keys
[ https://issues.apache.org/jira/browse/SPARK-13450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15812550#comment-15812550 ] Zhan Zhang commented on SPARK-13450: ExternalAppendOnlyMap estimate the size of the data saved. In SortMergeJoin, I think we can leverage UnsafeExternalSorter to get more accurate and controllable behavior. > SortMergeJoin will OOM when join rows have lot of same keys > --- > > Key: SPARK-13450 > URL: https://issues.apache.org/jira/browse/SPARK-13450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.2, 2.1.0 >Reporter: Hong Shen > Attachments: heap-dump-analysis.png > > > When I run a sql with join, task throw java.lang.OutOfMemoryError and sql > failed. I have set spark.executor.memory 4096m. > SortMergeJoin use a ArrayBuffer[InternalRow] to store bufferedMatches, if > the join rows have a lot of same key, it will throw OutOfMemoryError. > {code} > /** Buffered rows from the buffered side of the join. This is empty if > there are no matches. */ > private[this] val bufferedMatches: ArrayBuffer[InternalRow] = new > ArrayBuffer[InternalRow] > {code} > Here is the stackTrace: > {code} > org.xerial.snappy.SnappyNative.arrayCopy(Native Method) > org.xerial.snappy.Snappy.arrayCopy(Snappy.java:84) > org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190) > org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163) > java.io.DataInputStream.readFully(DataInputStream.java:195) > java.io.DataInputStream.readLong(DataInputStream.java:416) > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71) > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79) > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136) > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123) > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:84) > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoin.scala:300) > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.bufferMatchingRows(SortMergeJoin.scala:329) > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoin.scala:229) > org.apache.spark.sql.execution.joins.SortMergeJoin$$anonfun$doExecute$1$$anon$1.advanceNext(SortMergeJoin.scala:105) > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88) > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > org.apache.spark.scheduler.Task.run(Task.scala:89) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15722901#comment-15722901 ] Zhan Zhang commented on HBASE-15335: [~tedyu] It seems that I do not have permission to reassign. Could you help on this? Thanks. > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, > HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, > HBASE-15335-6.patch, HBASE-15335-7.patch, HBASE-15335-8.patch, > HBASE-15335-9.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-18637) Stateful UDF should be considered as nondeterministic
[ https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706961#comment-15706961 ] Zhan Zhang commented on SPARK-18637: [~hvanhovell] It is an annotation. /** * UDFType annotations are used to describe properties of a UDF. This gives * important information to the optimizer. * If the UDF is not deterministic, or if it is stateful, it is necessary to * annotate it as such for correctness. * */ @Public @Evolving @Target(ElementType.TYPE) @Retention(RetentionPolicy.RUNTIME) @Inherited public @interface UDFType { /** * Certain optimizations should not be applied if UDF is not deterministic. * Deterministic UDF returns same result each time it is invoked with a * particular input. This determinism just needs to hold within the context of * a query. * * @return true if the UDF is deterministic */ boolean deterministic() default true; /** * If a UDF stores state based on the sequence of records it has processed, it * is stateful. A stateful UDF cannot be used in certain expressions such as * case statement and certain optimizations such as AND/OR short circuiting * don't apply for such UDFs, as they need to be invoked for each record. * row_sequence is an example of stateful UDF. A stateful UDF is considered to * be non-deterministic, irrespective of what deterministic() returns. * * @return true */ boolean stateful() default false; /** * A UDF is considered distinctLike if the UDF can be evaluated on just the * distinct values of a column. Examples include min and max UDFs. This * information is used by metadata-only optimizer. * * @return true if UDF is distinctLike */ boolean distinctLike() default false; /** * Using in analytical functions to specify that UDF implies an ordering * * @return true if the function implies order */ boolean impliesOrder() default false; } > Stateful UDF should be considered as nondeterministic > - > > Key: SPARK-18637 > URL: https://issues.apache.org/jira/browse/SPARK-18637 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Zhan Zhang > > If the annotation UDFType of a udf is stateful, it shoudl be considered as > non-deterministic. Otherwise, the catalyst may optimize the plan and return > the wrong result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18637) Stateful UDF should be considered as nondeterministic
[ https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706961#comment-15706961 ] Zhan Zhang edited comment on SPARK-18637 at 11/29/16 11:52 PM: --- [~hvanhovell] It is an annotation. /** * UDFType annotations are used to describe properties of a UDF. This gives * important information to the optimizer. * If the UDF is not deterministic, or if it is stateful, it is necessary to * annotate it as such for correctness. * */ was (Author: zhzhan): [~hvanhovell] It is an annotation. /** * UDFType annotations are used to describe properties of a UDF. This gives * important information to the optimizer. * If the UDF is not deterministic, or if it is stateful, it is necessary to * annotate it as such for correctness. * */ @Public @Evolving @Target(ElementType.TYPE) @Retention(RetentionPolicy.RUNTIME) @Inherited public @interface UDFType { /** * Certain optimizations should not be applied if UDF is not deterministic. * Deterministic UDF returns same result each time it is invoked with a * particular input. This determinism just needs to hold within the context of * a query. * * @return true if the UDF is deterministic */ boolean deterministic() default true; /** * If a UDF stores state based on the sequence of records it has processed, it * is stateful. A stateful UDF cannot be used in certain expressions such as * case statement and certain optimizations such as AND/OR short circuiting * don't apply for such UDFs, as they need to be invoked for each record. * row_sequence is an example of stateful UDF. A stateful UDF is considered to * be non-deterministic, irrespective of what deterministic() returns. * * @return true */ boolean stateful() default false; /** * A UDF is considered distinctLike if the UDF can be evaluated on just the * distinct values of a column. Examples include min and max UDFs. This * information is used by metadata-only optimizer. * * @return true if UDF is distinctLike */ boolean distinctLike() default false; /** * Using in analytical functions to specify that UDF implies an ordering * * @return true if the function implies order */ boolean impliesOrder() default false; } > Stateful UDF should be considered as nondeterministic > - > > Key: SPARK-18637 > URL: https://issues.apache.org/jira/browse/SPARK-18637 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Zhan Zhang > > If the annotation UDFType of a udf is stateful, it shoudl be considered as > non-deterministic. Otherwise, the catalyst may optimize the plan and return > the wrong result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18637) Stateful UDF should be considered as nondeterministic
[ https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-18637: --- Component/s: SQL > Stateful UDF should be considered as nondeterministic > - > > Key: SPARK-18637 > URL: https://issues.apache.org/jira/browse/SPARK-18637 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Zhan Zhang > > If the annotation UDFType of a udf is stateful, it shoudl be considered as > non-deterministic. Otherwise, the catalyst may optimize the plan and return > the wrong result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18637) Stateful UDF should be considered as nondeterministic
[ https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706905#comment-15706905 ] Zhan Zhang commented on SPARK-18637: Here is the comments from UDFType /** * If a UDF stores state based on the sequence of records it has processed, it * is stateful. A stateful UDF cannot be used in certain expressions such as * case statement and certain optimizations such as AND/OR short circuiting * don't apply for such UDFs, as they need to be invoked for each record. * row_sequence is an example of stateful UDF. A stateful UDF is considered to * be non-deterministic, irrespective of what deterministic() returns. * * @return true */ boolean stateful() default false; > Stateful UDF should be considered as nondeterministic > - > > Key: SPARK-18637 > URL: https://issues.apache.org/jira/browse/SPARK-18637 > Project: Spark > Issue Type: Bug > Reporter: Zhan Zhang > > If the annotation UDFType of a udf is stateful, it shoudl be considered as > non-deterministic. Otherwise, the catalyst may optimize the plan and return > the wrong result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18637) Stateful UDF should be considered as nondeterministic
Zhan Zhang created SPARK-18637: -- Summary: Stateful UDF should be considered as nondeterministic Key: SPARK-18637 URL: https://issues.apache.org/jira/browse/SPARK-18637 Project: Spark Issue Type: Bug Reporter: Zhan Zhang If the annotation UDFType of a udf is stateful, it shoudl be considered as non-deterministic. Otherwise, the catalyst may optimize the plan and return the wrong result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18550) Make the queue capacity of LiveListenerBus configurable.
[ https://issues.apache.org/jira/browse/SPARK-18550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15688361#comment-15688361 ] Zhan Zhang commented on SPARK-18550: I was not aware it has been fixed already. Please help to close it. > Make the queue capacity of LiveListenerBus configurable. > > > Key: SPARK-18550 > URL: https://issues.apache.org/jira/browse/SPARK-18550 > Project: Spark > Issue Type: Bug > Components: Spark Core > Reporter: Zhan Zhang >Priority: Minor > > We meet issues that driver listener bus cannot catch up the speed of incoming > event. Current value is fixed as 1000. This value should be configurable per > job. Otherwise, when event is dropped, the UI is totally useless. > Bus: Dropping SparkListenerEvent because no remaining room in event queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18550) Make the queue capacity of LiveListenerBus configurable.
Zhan Zhang created SPARK-18550: -- Summary: Make the queue capacity of LiveListenerBus configurable. Key: SPARK-18550 URL: https://issues.apache.org/jira/browse/SPARK-18550 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Zhan Zhang Priority: Minor We meet issues that driver listener bus cannot catch up the speed of incoming event. Current value is fixed as 1000. This value should be configurable per job. Otherwise, when event is dropped, the UI is totally useless. Bus: Dropping SparkListenerEvent because no remaining room in event queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17637) Packed scheduling for Spark tasks across executors
[ https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687474#comment-15687474 ] Zhan Zhang commented on SPARK-17637: [~hvanhovell] Thanks. PR is updated with conflicts resolved. > Packed scheduling for Spark tasks across executors > -- > > Key: SPARK-17637 > URL: https://issues.apache.org/jira/browse/SPARK-17637 > Project: Spark > Issue Type: Improvement > Components: Scheduler > Reporter: Zhan Zhang > Assignee: Zhan Zhang >Priority: Minor > > Currently Spark scheduler implements round robin scheduling for tasks to > executors. Which is great as it distributes the load evenly across the > cluster, but this leads to significant resource waste in some cases, > especially when dynamic allocation is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
SparkPlan/Shuffle stage reuse with Dataset/DataFrame
Hi Folks, We have some Dataset/Dataframe use cases that will benefit from reuse the SparkPlan and shuffle stage. For example, the following cases. Because the query optimization and sparkplan is generated by catalyst when it is executed, as a result, the underlying RDD lineage is regenerated for dataset1. Thus, the shuffle stage will be executed multiple times. val dataset1 = dataset.groupby.agg df.registerTempTable("tmpTable") spark.sql("select * from tmpTable where condition").collect spark.sql("select * from tmpTable where condition1").cllect On the one side, we get optimized query plan, but on the other side, we cannot reuse the data generated by shuffle stage. Currently, to reuse the dataset1, we have to use persist to cache the data. It is helpful but sometimes is not what we want, as it has some side effect. For example, we cannot release the executor that has active cache in it even it is idle and dynamic allocator is enabled. In other words, we only want to reuse the shuffle data as much as possible without caching in a long pipeline with multiple shuffle stages. I am wondering does it make sense to add a new feature to Dataset/Dataframe to work as barrier and prevent the query optimization happens across the barrier. For example, in the above case, we want catalyst take tmpTable as a barrier, and stop optimization across it, so that we can reuse the underlying rdd lineage of dataset1. The prototype code to make it work is quite small, and we tried in house with a new API as Dataset.cacheShuffle to make this happen. But I want some feedback from community before opening a JIRA, as in some sense, it does stop the optimization earlier. Any comments? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SparkPlan-Shuffle-stage-reuse-with-Dataset-DataFrame-tp19502.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
[jira] [Commented] (SPARK-17637) Packed scheduling for Spark tasks across executors
[ https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15516851#comment-15516851 ] Zhan Zhang commented on SPARK-17637: [~jerryshao] The idea is straightforward. Instead of doing round robin on the executers with available cores, the new scheduling will try to allocate tasks to the executors with least available cores. As a result for the executors who has more free resources may not have new tasks allocated. With dynamic allocation enabled, these executors may be released so that other jobs can get required resources from underlying resource manager. It is not specific bound to dynamic allocation, but it is an easy way to understand the gains of the new scheduler. In addition, in the patch (soon to be sent out) there is also another scheduler which does exactly opposite thing by allocating tasks to executors with most available cores in order to balance the workload to all executors. > Packed scheduling for Spark tasks across executors > -- > > Key: SPARK-17637 > URL: https://issues.apache.org/jira/browse/SPARK-17637 > Project: Spark > Issue Type: Improvement > Components: Scheduler > Reporter: Zhan Zhang >Priority: Minor > > Currently Spark scheduler implements round robin scheduling for tasks to > executors. Which is great as it distributes the load evenly across the > cluster, but this leads to significant resource waste in some cases, > especially when dynamic allocation is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17637) Packed scheduling for Spark tasks across executors
[ https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15515247#comment-15515247 ] Zhan Zhang commented on SPARK-17637: cc [~rxin] A quick prototype shows that for a tested pipeline, the job can save around 45% regarding the reserved cpu and memory when the dynamic allocation is enabled. > Packed scheduling for Spark tasks across executors > -- > > Key: SPARK-17637 > URL: https://issues.apache.org/jira/browse/SPARK-17637 > Project: Spark > Issue Type: Improvement > Components: Scheduler > Reporter: Zhan Zhang >Priority: Minor > > Currently Spark scheduler implements round robin scheduling for tasks to > executors. Which is great as it distributes the load evenly across the > cluster, but this leads to significant resource waste in some cases, > especially when dynamic allocation is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17637) Packed scheduling for Spark tasks across executors
[ https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514105#comment-15514105 ] Zhan Zhang commented on SPARK-17637: The plan is to introduce a new configuration so that different scheduling algorithms can be used for the task scheduling. > Packed scheduling for Spark tasks across executors > -- > > Key: SPARK-17637 > URL: https://issues.apache.org/jira/browse/SPARK-17637 > Project: Spark > Issue Type: Improvement > Components: Scheduler > Reporter: Zhan Zhang >Priority: Minor > > Currently Spark scheduler implements round robin scheduling for tasks to > executors. Which is great as it distributes the load evenly across the > cluster, but this leads to significant resource waste in some cases, > especially when dynamic allocation is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17637) Packed scheduling for Spark tasks across executors
Zhan Zhang created SPARK-17637: -- Summary: Packed scheduling for Spark tasks across executors Key: SPARK-17637 URL: https://issues.apache.org/jira/browse/SPARK-17637 Project: Spark Issue Type: Improvement Components: Scheduler Reporter: Zhan Zhang Priority: Minor Currently Spark scheduler implements round robin scheduling for tasks to executors. Which is great as it distributes the load evenly across the cluster, but this leads to significant resource waste in some cases, especially when dynamic allocation is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17526) Display the executor log links with the job failure message on Spark UI and Console
Zhan Zhang created SPARK-17526: -- Summary: Display the executor log links with the job failure message on Spark UI and Console Key: SPARK-17526 URL: https://issues.apache.org/jira/browse/SPARK-17526 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Zhan Zhang Priority: Minor Display the executor log links with the job failure message on Spark UI and Console "Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, HostName): java.lang.Exception: foo" To make this failure message more helpful, we should have the executor log link in the driver log and web ui as well on which the task failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15378457#comment-15378457 ] Zhan Zhang commented on HBASE-15335: [~tedyu] The scaladoc warning seems to be false positive, as I didn't see the map of the comments of method apply in object HBaseTableCatalo > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, > HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, > HBASE-15335-6.patch, HBASE-15335-7.patch, HBASE-15335-8.patch, > HBASE-15335-9.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15335: --- Attachment: HBASE-15335-9.patch > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, > HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, > HBASE-15335-6.patch, HBASE-15335-7.patch, HBASE-15335-8.patch, > HBASE-15335-9.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15335: --- Attachment: HBASE-15335-8.patch > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, > HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, > HBASE-15335-6.patch, HBASE-15335-7.patch, HBASE-15335-8.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Anyone knows the hive repo for spark-2.0?
I saw the pom file having hive version as 1.2.1.spark2. But I cannot find the branch in https://github.com/pwendell/ Does anyone know where the repo is? Thanks. Zhan Zhang -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Anyone-knows-the-hive-repo-for-spark-2-0-tp18234.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
[jira] [Updated] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15335: --- Status: Patch Available (was: Open) > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, > HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, > HBASE-15335-6.patch, HBASE-15335-7.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15335: --- Status: Open (was: Patch Available) > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, > HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, > HBASE-15335-6.patch, HBASE-15335-7.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15335: --- Status: Open (was: Patch Available) > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, > HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, > HBASE-15335-6.patch, HBASE-15335-7.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15335: --- Status: Patch Available (was: Open) > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, > HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, > HBASE-15335-6.patch, HBASE-15335-7.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15335: --- Attachment: HBASE-15335-7.patch > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, > HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, > HBASE-15335-6.patch, HBASE-15335-7.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter
[ https://issues.apache.org/jira/browse/HBASE-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-16017: --- Attachment: HBASE-16017-1.patch > HBase TableOutputFormat has connection leak in getRecordWriter > -- > > Key: HBASE-16017 > URL: https://issues.apache.org/jira/browse/HBASE-16017 > Project: HBase > Issue Type: Bug > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-16017-1.patch > > > Currently getRecordWriter will not release the connection until jvm > terminate, which is not a right assumption given that the function may be > invoked many times in the same jvm lifecycle. Inside of mapreduce, the issue > has already fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter
[ https://issues.apache.org/jira/browse/HBASE-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-16017: --- Attachment: (was: HBASE-16017-1.patch) > HBase TableOutputFormat has connection leak in getRecordWriter > -- > > Key: HBASE-16017 > URL: https://issues.apache.org/jira/browse/HBASE-16017 > Project: HBase > Issue Type: Bug > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-16017-1.patch > > > Currently getRecordWriter will not release the connection until jvm > terminate, which is not a right assumption given that the function may be > invoked many times in the same jvm lifecycle. Inside of mapreduce, the issue > has already fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter
[ https://issues.apache.org/jira/browse/HBASE-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-16017: --- Status: Patch Available (was: Open) > HBase TableOutputFormat has connection leak in getRecordWriter > -- > > Key: HBASE-16017 > URL: https://issues.apache.org/jira/browse/HBASE-16017 > Project: HBase > Issue Type: Bug > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-16017-1.patch > > > Currently getRecordWriter will not release the connection until jvm > terminate, which is not a right assumption given that the function may be > invoked many times in the same jvm lifecycle. Inside of mapreduce, the issue > has already fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter
[ https://issues.apache.org/jira/browse/HBASE-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328354#comment-15328354 ] Zhan Zhang commented on HBASE-16017: [~te...@apache.org] Can you please take a look? It is a simple fix. To make the impact as small as possible, I added a new constructor, so that there is no change to any other modules. > HBase TableOutputFormat has connection leak in getRecordWriter > -- > > Key: HBASE-16017 > URL: https://issues.apache.org/jira/browse/HBASE-16017 > Project: HBase > Issue Type: Bug > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-16017-1.patch > > > Currently getRecordWriter will not release the connection until jvm > terminate, which is not a right assumption given that the function may be > invoked many times in the same jvm lifecycle. Inside of mapreduce, the issue > has already fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter
[ https://issues.apache.org/jira/browse/HBASE-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-16017: --- Attachment: HBASE-16017-1.patch > HBase TableOutputFormat has connection leak in getRecordWriter > -- > > Key: HBASE-16017 > URL: https://issues.apache.org/jira/browse/HBASE-16017 > Project: HBase > Issue Type: Bug > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-16017-1.patch > > > Currently getRecordWriter will not release the connection until jvm > terminate, which is not a right assumption given that the function may be > invoked many times in the same jvm lifecycle. Inside of mapreduce, the issue > has already fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter
Zhan Zhang created HBASE-16017: -- Summary: HBase TableOutputFormat has connection leak in getRecordWriter Key: HBASE-16017 URL: https://issues.apache.org/jira/browse/HBASE-16017 Project: HBase Issue Type: Bug Reporter: Zhan Zhang Assignee: Zhan Zhang Currently getRecordWriter will not release the connection until jvm terminate, which is not a right assumption given that the function may be invoked many times in the same jvm lifecycle. Inside of mapreduce, the issue has already fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter
Zhan Zhang created HBASE-16017: -- Summary: HBase TableOutputFormat has connection leak in getRecordWriter Key: HBASE-16017 URL: https://issues.apache.org/jira/browse/HBASE-16017 Project: HBase Issue Type: Bug Reporter: Zhan Zhang Assignee: Zhan Zhang Currently getRecordWriter will not release the connection until jvm terminate, which is not a right assumption given that the function may be invoked many times in the same jvm lifecycle. Inside of mapreduce, the issue has already fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15335: --- Attachment: HBASE-15335-6.patch > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, > HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, > HBASE-15335-6.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15335: --- Attachment: HBASE-15335-5.patch > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, > HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-15848) Spark unable to read partitioned table in avro format and column name in upper case
[ https://issues.apache.org/jira/browse/SPARK-15848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-15848: --- Affects Version/s: 1.6.1 > Spark unable to read partitioned table in avro format and column name in > upper case > --- > > Key: SPARK-15848 > URL: https://issues.apache.org/jira/browse/SPARK-15848 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Zhan Zhang > > If external partitioned Hive tables created in Avro format. > Spark is returning "null" values if columns names are in Uppercase in the > Avro schema. > The same tables return proper data when queried in the Hive client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15848) Spark unable to read partitioned table in avro format and column name in upper case
[ https://issues.apache.org/jira/browse/SPARK-15848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323195#comment-15323195 ] Zhan Zhang commented on SPARK-15848: cat > file1.csv< file2.csv< val tbl = sqlContext.table("default.avro_table_uppercase"); scala> tbl.show +--+--+-++ |student_id|subject_id|marks|year| +--+--+-++ | null| null| 100|2000| | null| null| 20|2000| | null| null| 160|2000| | null| null| 963|2000| | null| null| 142|2000| | null| null| 430|2000| | null| null| 91|2002| | null| null| 28|2002| | null| null| 16|2002| | null| null| 96|2002| | null| null| 14|2002| | null| null| 43|2002| +--+--+-++ > Spark unable to read partitioned table in avro format and column name in > upper case > --- > > Key: SPARK-15848 > URL: https://issues.apache.org/jira/browse/SPARK-15848 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Zhan Zhang > > If external partitioned Hive tables created in Avro format. > Spark is returning "null" values if columns names are in Uppercase in the > Avro schema. > The same tables return proper data when queried in the Hive client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15848) Spark unable to read partitioned table in avro format and column name in upper case
Zhan Zhang created SPARK-15848: -- Summary: Spark unable to read partitioned table in avro format and column name in upper case Key: SPARK-15848 URL: https://issues.apache.org/jira/browse/SPARK-15848 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhan Zhang If external partitioned Hive tables created in Avro format. Spark is returning "null" values if columns names are in Uppercase in the Avro schema. The same tables return proper data when queried in the Hive client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15335: --- Attachment: HBASE-15335-4.patch > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, > HBASE-15335-3.patch, HBASE-15335-4.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15473) Documentation for the usage of hbase dataframe user api (JSON, Avro, etc)
[ https://issues.apache.org/jira/browse/HBASE-15473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15473: --- Assignee: (was: Zhan Zhang) > Documentation for the usage of hbase dataframe user api (JSON, Avro, etc) > - > > Key: HBASE-15473 > URL: https://issues.apache.org/jira/browse/HBASE-15473 > Project: HBase > Issue Type: Sub-task > Components: documentation, spark > Reporter: Zhan Zhang >Priority: Blocker > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format
[ https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang reassigned HBASE-14801: -- Assignee: Zhan Zhang > Enhance the Spark-HBase connector catalog with json format > -- > > Key: HBASE-14801 > URL: https://issues.apache.org/jira/browse/HBASE-14801 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch, > HBASE-14801-3.patch, HBASE-14801-4.patch, HBASE-14801-5.patch, > HBASE-14801-6.patch, HBASE-14801-7.patch, HBASE-14801-8.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format
[ https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-14801: --- Assignee: (was: Zhan Zhang) > Enhance the Spark-HBase connector catalog with json format > -- > > Key: HBASE-14801 > URL: https://issues.apache.org/jira/browse/HBASE-14801 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch, > HBASE-14801-3.patch, HBASE-14801-4.patch, HBASE-14801-5.patch, > HBASE-14801-6.patch, HBASE-14801-7.patch, HBASE-14801-8.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-15441) dataset outer join seems to return incorrect result
[ https://issues.apache.org/jira/browse/SPARK-15441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297881#comment-15297881 ] Zhan Zhang commented on SPARK-15441: Currently new GenericInternalRow(right.output.length) is used as nullRow, but actually it cannot be used to identify the difference of row itself is null or all columns are null. Probably we can add a special row nullRow to represent that the InternalRow itself is null, so that Encoder can identify whether the object itself is null or not. > dataset outer join seems to return incorrect result > --- > > Key: SPARK-15441 > URL: https://issues.apache.org/jira/browse/SPARK-15441 > Project: Spark > Issue Type: Bug > Components: sq; >Reporter: Reynold Xin >Assignee: Wenchen Fan >Priority: Critical > > See notebook > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2836020637783173/5382278320999420/latest.html > {code} > import org.apache.spark.sql.functions > val left = List(("a", 1), ("a", 2), ("b", 3), ("c", 4)).toDS() > val right = List(("a", "x"), ("b", "y"), ("d", "z")).toDS() > // The last row _1 should be null, rather than (null, -1) > left.toDF("k", "v").as[(String, Int)].alias("left") > .joinWith(right.toDF("k", "u").as[(String, String)].alias("right"), > functions.col("left.k") === functions.col("right.k"), "right_outer") > .show() > {code} > The returned result currently is > {code} > +-+-+ > | _1| _2| > +-+-+ > |(a,2)|(a,x)| > |(a,1)|(a,x)| > |(b,3)|(b,y)| > |(null,-1)|(d,z)| > +-+-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: right outer joins on Datasets
The reason for "-1" is that the default value for Integer is -1 if the value is null def defaultValue(jt: String): String = jt match { ... case JAVA_INT => "-1" ... } -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/right-outer-joins-on-Datasets-tp17542p17651.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[jira] [Updated] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15335: --- Status: Patch Available (was: Open) > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, > HBASE-15335-3.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15335: --- Status: Open (was: Patch Available) > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, > HBASE-15335-3.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15335: --- Attachment: HBASE-15335-3.patch > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, > HBASE-15335-3.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-15473) Documentation for the usage of hbase dataframe user api (JSON, Avro, etc)
[ https://issues.apache.org/jira/browse/HBASE-15473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289993#comment-15289993 ] Zhan Zhang commented on HBASE-15473: [~WeiqingYang] is working on this. > Documentation for the usage of hbase dataframe user api (JSON, Avro, etc) > - > > Key: HBASE-15473 > URL: https://issues.apache.org/jira/browse/HBASE-15473 > Project: HBase > Issue Type: Sub-task > Components: documentation, spark > Reporter: Zhan Zhang > Assignee: Zhan Zhang >Priority: Blocker > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15335: --- Attachment: HBASE-15335-2.patch > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15335: --- Attachment: HBASE-15335-2.patch > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15335: --- Attachment: HBASE-15335-2.patch > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15335: --- Attachment: (was: HBASE-15335-2.patch) > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15335) Add composite key support in row key
[ https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15335: --- Attachment: (was: HBASE-15335-2.patch) > Add composite key support in row key > > > Key: HBASE-15335 > URL: https://issues.apache.org/jira/browse/HBASE-15335 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15335-1.patch > > > Add composite key filter support in the connector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-15825) Fix the null pointer in DynamicLogicExpressionSuite
[ https://issues.apache.org/jira/browse/HBASE-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284923#comment-15284923 ] Zhan Zhang commented on HBASE-15825: [~ted_yu] Thanks a lot > Fix the null pointer in DynamicLogicExpressionSuite > --- > > Key: HBASE-15825 > URL: https://issues.apache.org/jira/browse/HBASE-15825 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Fix For: 2.0.0 > > Attachments: HBASE-15825-1.patch > > > It only happens in test cases. Not sure why it is not caught. Will submit > patch soon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15825) Fix the null pointer in DynamicLogicExpressionSuite
[ https://issues.apache.org/jira/browse/HBASE-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15825: --- Status: Patch Available (was: Open) > Fix the null pointer in DynamicLogicExpressionSuite > --- > > Key: HBASE-15825 > URL: https://issues.apache.org/jira/browse/HBASE-15825 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Fix For: 2.0.0 > > Attachments: HBASE-15825-1.patch > > > It only happens in test cases. Not sure why it is not caught. Will submit > patch soon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-15825) Fix the null pointer in DynamicLogicExpressionSuite
[ https://issues.apache.org/jira/browse/HBASE-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283172#comment-15283172 ] Zhan Zhang commented on HBASE-15825: [~te...@apache.org] [~jmhsieh] Can you please take a quick look? It is a straightforward fix. > Fix the null pointer in DynamicLogicExpressionSuite > --- > > Key: HBASE-15825 > URL: https://issues.apache.org/jira/browse/HBASE-15825 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Fix For: 2.0.0 > > Attachments: HBASE-15825-1.patch > > > It only happens in test cases. Not sure why it is not caught. Will submit > patch soon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15825) Fix the null pointer in DynamicLogicExpressionSuite
[ https://issues.apache.org/jira/browse/HBASE-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15825: --- Attachment: HBASE-15825-1.patch > Fix the null pointer in DynamicLogicExpressionSuite > --- > > Key: HBASE-15825 > URL: https://issues.apache.org/jira/browse/HBASE-15825 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Fix For: 2.0.0 > > Attachments: HBASE-15825-1.patch > > > It only happens in test cases. Not sure why it is not caught. Will submit > patch soon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-15333) [hbase-spark] Enhance dataframe filters to handle naively encoded short, integer, long, float and double
[ https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283159#comment-15283159 ] Zhan Zhang commented on HBASE-15333: Open jira HBASE-15825 to fix the test case. > [hbase-spark] Enhance dataframe filters to handle naively encoded short, > integer, long, float and double > > > Key: HBASE-15333 > URL: https://issues.apache.org/jira/browse/HBASE-15333 > Project: HBase > Issue Type: Sub-task > Components: spark > Reporter: Zhan Zhang >Assignee: Zhan Zhang > Fix For: 2.0.0 > > Attachments: HBASE-15333-1.patch, HBASE-15333-10.patch, > HBASE-15333-2.patch, HBASE-15333-3.patch, HBASE-15333-4.patch, > HBASE-15333-5.patch, HBASE-15333-6.patch, HBASE-15333-7.patch, > HBASE-15333-8.patch, HBASE-15333-9.patch > > > Currently, the range filter is based on the order of bytes. But for java > primitive type, such as short, int, long, double, float, etc, their order is > not consistent with their byte order, extra manipulation has to be in place > to take care of them correctly. > For example, for the integer range (-100, 100), the filter <= 1, the current > filter will return 0 and 1, and the right return value should be (-100, 1] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-15825) Fix the null pointer in DynamicLogicExpressionSuite
Zhan Zhang created HBASE-15825: -- Summary: Fix the null pointer in DynamicLogicExpressionSuite Key: HBASE-15825 URL: https://issues.apache.org/jira/browse/HBASE-15825 Project: HBase Issue Type: Sub-task Reporter: Zhan Zhang It only happens in test cases. Not sure why it is not caught. Will submit patch soon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-15825) Fix the null pointer in DynamicLogicExpressionSuite
Zhan Zhang created HBASE-15825: -- Summary: Fix the null pointer in DynamicLogicExpressionSuite Key: HBASE-15825 URL: https://issues.apache.org/jira/browse/HBASE-15825 Project: HBase Issue Type: Sub-task Reporter: Zhan Zhang It only happens in test cases. Not sure why it is not caught. Will submit patch soon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-15333) [hbase-spark] Enhance dataframe filters to handle naively encoded short, integer, long, float and double
[ https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283153#comment-15283153 ] Zhan Zhang commented on HBASE-15333: Not sure why it is not caught during the systest. The failure happens only in test cases. Will open a followup jira to fix the issue. > [hbase-spark] Enhance dataframe filters to handle naively encoded short, > integer, long, float and double > > > Key: HBASE-15333 > URL: https://issues.apache.org/jira/browse/HBASE-15333 > Project: HBase > Issue Type: Sub-task > Components: spark > Reporter: Zhan Zhang >Assignee: Zhan Zhang > Fix For: 2.0.0 > > Attachments: HBASE-15333-1.patch, HBASE-15333-10.patch, > HBASE-15333-2.patch, HBASE-15333-3.patch, HBASE-15333-4.patch, > HBASE-15333-5.patch, HBASE-15333-6.patch, HBASE-15333-7.patch, > HBASE-15333-8.patch, HBASE-15333-9.patch > > > Currently, the range filter is based on the order of bytes. But for java > primitive type, such as short, int, long, double, float, etc, their order is > not consistent with their byte order, extra manipulation has to be in place > to take care of them correctly. > For example, for the integer range (-100, 100), the filter <= 1, the current > filter will return 0 and 1, and the right return value should be (-100, 1] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15333) Enhance the filter to handle short, integer, long, float and double
[ https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15333: --- Attachment: HBASE-15333-10.patch > Enhance the filter to handle short, integer, long, float and double > --- > > Key: HBASE-15333 > URL: https://issues.apache.org/jira/browse/HBASE-15333 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15333-1.patch, HBASE-15333-10.patch, > HBASE-15333-2.patch, HBASE-15333-3.patch, HBASE-15333-4.patch, > HBASE-15333-5.patch, HBASE-15333-6.patch, HBASE-15333-7.patch, > HBASE-15333-8.patch, HBASE-15333-9.patch > > > Currently, the range filter is based on the order of bytes. But for java > primitive type, such as short, int, long, double, float, etc, their order is > not consistent with their byte order, extra manipulation has to be in place > to take care of them correctly. > For example, for the integer range (-100, 100), the filter <= 1, the current > filter will return 0 and 1, and the right return value should be (-100, 1] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-15333) Enhance the filter to handle short, integer, long, float and double
[ https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280574#comment-15280574 ] Zhan Zhang commented on HBASE-15333: [~jmhsieh] Would you like to take a final look? Thanks. > Enhance the filter to handle short, integer, long, float and double > --- > > Key: HBASE-15333 > URL: https://issues.apache.org/jira/browse/HBASE-15333 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15333-1.patch, HBASE-15333-2.patch, > HBASE-15333-3.patch, HBASE-15333-4.patch, HBASE-15333-5.patch, > HBASE-15333-6.patch, HBASE-15333-7.patch, HBASE-15333-8.patch, > HBASE-15333-9.patch > > > Currently, the range filter is based on the order of bytes. But for java > primitive type, such as short, int, long, double, float, etc, their order is > not consistent with their byte order, extra manipulation has to be in place > to take care of them correctly. > For example, for the integer range (-100, 100), the filter <= 1, the current > filter will return 0 and 1, and the right return value should be (-100, 1] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-15333) Enhance the filter to handle short, integer, long, float and double
[ https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274815#comment-15274815 ] Zhan Zhang commented on HBASE-15333: I checked the warning, they are false positive. > Enhance the filter to handle short, integer, long, float and double > --- > > Key: HBASE-15333 > URL: https://issues.apache.org/jira/browse/HBASE-15333 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15333-1.patch, HBASE-15333-2.patch, > HBASE-15333-3.patch, HBASE-15333-4.patch, HBASE-15333-5.patch, > HBASE-15333-6.patch, HBASE-15333-7.patch, HBASE-15333-8.patch, > HBASE-15333-9.patch > > > Currently, the range filter is based on the order of bytes. But for java > primitive type, such as short, int, long, double, float, etc, their order is > not consistent with their byte order, extra manipulation has to be in place > to take care of them correctly. > For example, for the integer range (-100, 100), the filter <= 1, the current > filter will return 0 and 1, and the right return value should be (-100, 1] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15333) Enhance the filter to handle short, integer, long, float and double
[ https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15333: --- Attachment: HBASE-15333-9.patch > Enhance the filter to handle short, integer, long, float and double > --- > > Key: HBASE-15333 > URL: https://issues.apache.org/jira/browse/HBASE-15333 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15333-1.patch, HBASE-15333-2.patch, > HBASE-15333-3.patch, HBASE-15333-4.patch, HBASE-15333-5.patch, > HBASE-15333-6.patch, HBASE-15333-7.patch, HBASE-15333-8.patch, > HBASE-15333-9.patch > > > Currently, the range filter is based on the order of bytes. But for java > primitive type, such as short, int, long, double, float, etc, their order is > not consistent with their byte order, extra manipulation has to be in place > to take care of them correctly. > For example, for the integer range (-100, 100), the filter <= 1, the current > filter will return 0 and 1, and the right return value should be (-100, 1] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15333) Enhance the filter to handle short, integer, long, float and double
[ https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15333: --- Attachment: HBASE-15333-8.patch > Enhance the filter to handle short, integer, long, float and double > --- > > Key: HBASE-15333 > URL: https://issues.apache.org/jira/browse/HBASE-15333 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15333-1.patch, HBASE-15333-2.patch, > HBASE-15333-3.patch, HBASE-15333-4.patch, HBASE-15333-5.patch, > HBASE-15333-6.patch, HBASE-15333-7.patch, HBASE-15333-8.patch > > > Currently, the range filter is based on the order of bytes. But for java > primitive type, such as short, int, long, double, float, etc, their order is > not consistent with their byte order, extra manipulation has to be in place > to take care of them correctly. > For example, for the integer range (-100, 100), the filter <= 1, the current > filter will return 0 and 1, and the right return value should be (-100, 1] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15333) Enhance the filter to handle short, integer, long, float and double
[ https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15333: --- Attachment: HBASE-15333-7.patch solve warnings > Enhance the filter to handle short, integer, long, float and double > --- > > Key: HBASE-15333 > URL: https://issues.apache.org/jira/browse/HBASE-15333 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15333-1.patch, HBASE-15333-2.patch, > HBASE-15333-3.patch, HBASE-15333-4.patch, HBASE-15333-5.patch, > HBASE-15333-6.patch, HBASE-15333-7.patch > > > Currently, the range filter is based on the order of bytes. But for java > primitive type, such as short, int, long, double, float, etc, their order is > not consistent with their byte order, extra manipulation has to be in place > to take care of them correctly. > For example, for the integer range (-100, 100), the filter <= 1, the current > filter will return 0 and 1, and the right return value should be (-100, 1] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-15333) Enhance the filter to handle short, integer, long, float and double
[ https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated HBASE-15333: --- Attachment: HBASE-15333-6.patch Solve review comments. Restructure the encoder as a plugin, so that it can be extended to support other format. > Enhance the filter to handle short, integer, long, float and double > --- > > Key: HBASE-15333 > URL: https://issues.apache.org/jira/browse/HBASE-15333 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15333-1.patch, HBASE-15333-2.patch, > HBASE-15333-3.patch, HBASE-15333-4.patch, HBASE-15333-5.patch, > HBASE-15333-6.patch > > > Currently, the range filter is based on the order of bytes. But for java > primitive type, such as short, int, long, double, float, etc, their order is > not consistent with their byte order, extra manipulation has to be in place > to take care of them correctly. > For example, for the integer range (-100, 100), the filter <= 1, the current > filter will return 0 and 1, and the right return value should be (-100, 1] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-15333) Enhance the filter to handle short, integer, long, float and double
[ https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15267434#comment-15267434 ] Zhan Zhang commented on HBASE-15333: Thanks for the feedback, and I will restructure the code and sent out for review this week. > Enhance the filter to handle short, integer, long, float and double > --- > > Key: HBASE-15333 > URL: https://issues.apache.org/jira/browse/HBASE-15333 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15333-1.patch, HBASE-15333-2.patch, > HBASE-15333-3.patch, HBASE-15333-4.patch, HBASE-15333-5.patch > > > Currently, the range filter is based on the order of bytes. But for java > primitive type, such as short, int, long, double, float, etc, their order is > not consistent with their byte order, extra manipulation has to be in place > to take care of them correctly. > For example, for the integer range (-100, 100), the filter <= 1, the current > filter will return 0 and 1, and the right return value should be (-100, 1] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-15333) Enhance the filter to handle short, integer, long, float and double
[ https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260928#comment-15260928 ] Zhan Zhang commented on HBASE-15333: [~jmhsieh] Thanks for reviewing the code. I want to discuss in more details before changing the code. Following is my comments: 1. DefaultSource.scala: It is not a replacement, instead it fix the logic in the partition pruning and predicate pushdown logic (here we assume the data is naive encoded, same as the current code base). 2. DefaultSource.scala:618: typo- Will fix it. 3. Make sense. Will do it. 4. 4.1 will move to separate class. 4.2 We are not assume any specific encoding/decoding here. As we want to support the java primitive type as already done in the current existing codebase. You are definitely right that we may want to more flexibility and different encoder/decoder. I think fixing the current naive bytes may take priority, and can be the first step. 5. If we don't change it, and there is PassFilter, it will crash the region server. 6. BoundRange.scala:26: Will format the doc in more formal way. Typically it is inclusive, but the upper level logic need to take care of exclusive for special cases. 7. Will change FilterOps to JavaBytesEncoder 8. Will enhance the current test cases. Overall, I think we can first make the code base support naive encoding work correctly as the first step, and the framework level it does not prevent adding special encoding/decoding, which can be added later by me or other contributors. How do you think? Please let me know if you have any concerns. > Enhance the filter to handle short, integer, long, float and double > --- > > Key: HBASE-15333 > URL: https://issues.apache.org/jira/browse/HBASE-15333 > Project: HBase > Issue Type: Sub-task > Reporter: Zhan Zhang > Assignee: Zhan Zhang > Attachments: HBASE-15333-1.patch, HBASE-15333-2.patch, > HBASE-15333-3.patch, HBASE-15333-4.patch, HBASE-15333-5.patch > > > Currently, the range filter is based on the order of bytes. But for java > primitive type, such as short, int, long, double, float, etc, their order is > not consistent with their byte order, extra manipulation has to be in place > to take care of them correctly. > For example, for the integer range (-100, 100), the filter <= 1, the current > filter will return 0 and 1, and the right return value should be (-100, 1] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: How this unit test passed on master trunk?
There are multiple records for the DF scala> structDF.groupBy($"a").agg(min(struct($"record.*"))).show +---+-+ | a|min(struct(unresolvedstar()))| +---+-+ | 1|[1,1]| | 3|[3,1]| | 2|[2,1]| The meaning of .groupBy($"a").agg(min(struct($"record.*"))) is to get the min for all the records with the same $”a” For example: TestData2(1,1) :: TestData2(1,2) The result would be 1, (1, 1), since struct(1, 1) is less than struct(1, 2). Please check how the Ordering is implemented in InterpretedOrdering. The output itself does not have any ordering. I am not sure why the unit test and the real env have different environment. Xiao, I do see the difference between unit test and local cluster run. Do you know the reason? Thanks. Zhan Zhang On Apr 22, 2016, at 11:23 AM, Yong Zhang <java8...@hotmail.com<mailto:java8...@hotmail.com>> wrote: Hi, I was trying to find out why this unit test can pass in Spark code. in https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala for this unit test: test("Star Expansion - CreateStruct and CreateArray") { val structDf = testData2.select("a", "b").as("record") // CreateStruct and CreateArray in aggregateExpressions assert(structDf.groupBy($"a").agg(min(struct($"record.*"))).first() == Row(3, Row(3, 1))) assert(structDf.groupBy($"a").agg(min(array($"record.*"))).first() == Row(3, Seq(3, 1))) // CreateStruct and CreateArray in project list (unresolved alias) assert(structDf.select(struct($"record.*")).first() == Row(Row(1, 1))) assert(structDf.select(array($"record.*")).first().getAs[Seq[Int]](0) === Seq(1, 1)) // CreateStruct and CreateArray in project list (alias) assert(structDf.select(struct($"record.*").as("a")).first() == Row(Row(1, 1))) assert(structDf.select(array($"record.*").as("a")).first().getAs[Seq[Int]](0) === Seq(1, 1)) } >From my understanding, the data return in this case should be Row(1, Row(1, >1]), as that will be min of struct. In fact, if I run the spark-shell on my laptop, and I got the result I expected: ./bin/spark-shell Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91) Type in expressions to have them evaluated. Type :help for more information. scala> case class TestData2(a: Int, b: Int) defined class TestData2 scala> val testData2DF = sqlContext.sparkContext.parallelize(TestData2(1,1) :: TestData2(1,2) :: TestData2(2,1) :: TestData2(2,2) :: TestData2(3,1) :: TestData2(3,2) :: Nil, 2).toDF() scala> val structDF = testData2DF.select("a","b").as("record") scala> structDF.groupBy($"a").agg(min(struct($"record.*"))).first() res0: org.apache.spark.sql.Row = [1,[1,1]] scala> structDF.show +---+---+ | a| b| +---+---+ | 1| 1| | 1| 2| | 2| 1| | 2| 2| | 3| 1| | 3| 2| +---+---+ So from my spark, which I built on the master, I cannot get Row[3,[1,1]] back in this case. Why the unit test asserts that Row[3,[1,1]] should be the first, and it will pass? But I cannot reproduce that in my spark-shell? I am trying to understand how to interpret the meaning of "agg(min(struct($"record.*")))" Thanks Yong
Re: Save DataFrame to HBase
You can try this https://github.com/hortonworks/shc.git or here http://spark-packages.org/package/zhzhan/shc Currently it is in the process of merging into HBase. Thanks. Zhan Zhang On Apr 21, 2016, at 8:44 AM, Benjamin Kim <bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote: Hi Ted, Can this module be used with an older version of HBase, such as 1.0 or 1.1? Where can I get the module from? Thanks, Ben On Apr 21, 2016, at 6:56 AM, Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote: The hbase-spark module in Apache HBase (coming with hbase 2.0 release) can do this. On Thu, Apr 21, 2016 at 6:52 AM, Benjamin Kim <bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote: Has anyone found an easy way to save a DataFrame into HBase? Thanks, Ben - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>
Re: Spark SQL insert overwrite table not showing all the partition.
INSERT OVERWRITE will overwrite any existing data in the table or partition * unless IF NOT EXISTS is provided for a partition (as of Hive 0.9.0<https://issues.apache.org/jira/browse/HIVE-2612>). Thanks. Zhan Zhang On Apr 21, 2016, at 3:20 PM, Bijay Kumar Pathak <bkpat...@mtu.edu<mailto:bkpat...@mtu.edu>> wrote: Hi, I have a job which writes to the Hive table with dynamic partition. Inside the job, I am writing into the table two-time but I am only seeing the partition with last write although I can see in the Spark UI it is processing data fro both the partition. Below is the query I am using to write to the table. hive_c.sql("""INSERT OVERWRITE TABLE base_table PARTITION (date='{1}', date_2) SELECT * from temp_table """.format(date_val) ) Thanks, Bijay
Re: Spark DataFrame sum of multiple columns
You can define your own udf, following is one example Thanks Zhan Zhang val foo = udf((a: Int, b: String) => a.toString + b) checkAnswer( // SELECT *, foo(key, value) FROM testData testData.select($"*", foo('key, 'value)).limit(3), On Apr 21, 2016, at 8:51 PM, Naveen Kumar Pokala <npok...@spcapitaliq.com<mailto:npok...@spcapitaliq.com>> wrote: Hi, Do we have any way to perform Row level operations in spark dataframes. For example, I have a dataframe with columns from A,B,C,…Z.. I want to add one more column New Column with sum of all column values. A B C D . . . Z New Column 1 2 4 3 26 351 Can somebody help me on this? Thanks, Naveen
Re: Why Spark having OutOfMemory Exception?
The data may be not large, but the driver need to do a lot of bookkeeping. In your case, it is possible the driver control plane takes too much memory. I think you can find a java developer to look at the coredump. Otherwise, it is hard to tell exactly which part are using all the memory. Thanks. Zhan Zhang On Apr 20, 2016, at 1:38 AM, 李明伟 <kramer2...@126.com<mailto:kramer2...@126.com>> wrote: Hi the input data size is less than 10M. The task result size should be less I think. Because I am doing aggregation on the data At 2016-04-20 16:18:31, "Jeff Zhang" <zjf...@gmail.com<mailto:zjf...@gmail.com>> wrote: Do you mean the input data size as 10M or the task result size ? >>> But my way is to setup a forever loop to handle continued income data. Not >>> sure if it is the right way to use spark Not sure what this mean, do you use spark-streaming, for doing batch job in the forever loop ? On Wed, Apr 20, 2016 at 3:55 PM, 李明伟 <kramer2...@126.com<mailto:kramer2...@126.com>> wrote: Hi Jeff The total size of my data is less than 10M. I already set the driver memory to 4GB. 在 2016-04-20 13:42:25,"Jeff Zhang" <zjf...@gmail.com<mailto:zjf...@gmail.com>> 写道: Seems it is OOM in driver side when fetching task result. You can try to increase spark.driver.memory and spark.driver.maxResultSize On Tue, Apr 19, 2016 at 4:06 PM, 李明伟 <kramer2...@126.com<mailto:kramer2...@126.com>> wrote: Hi Zhan Zhang Please see the exception trace below. It is saying some GC overhead limit error I am not a java or scala developer so it is hard for me to understand these infor. Also reading coredump is too difficult to me.. I am not sure if the way I am using spark is correct. I understand that spark can do batch or stream calculation. But my way is to setup a forever loop to handle continued income data. Not sure if it is the right way to use spark 16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread task-result-getter-2 java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328) at scala.collection.immutable.HashMap.updated(HashMap.scala:54) at scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516) at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) at org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219) at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) at org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109) Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328) at scala.collection.immutable.HashMap.updated(HashMap.scala:54) at scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516) at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source) at sun.reflect.DelegatingMethod
Re: [GRAPHX] Graph Algorithms and Spark
You can take a look at this blog from data bricks about GraphFrames https://databricks.com/blog/2016/03/03/introducing-graphframes.html Thanks. Zhan Zhang On Apr 21, 2016, at 12:53 PM, Robin East <robin.e...@xense.co.uk<mailto:robin.e...@xense.co.uk>> wrote: Hi Aside from LDA, which is implemented in MLLib, GraphX has the following built-in algorithms: * PageRank/Personalised PageRank * Connected Components * Strongly Connected Components * Triangle Count * Shortest Paths * Label Propagation It also implements a version of Pregel framework, a form of bulk-synchronous parallel processing that is the foundation of most of the above algorithms. We cover other algorithms in our book and if you search on google you will find a number of other examples. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action On 21 Apr 2016, at 19:47, tgensol <thibaut.gensol...@gmail.com<mailto:thibaut.gensol...@gmail.com>> wrote: Hi there, I am working in a group of the University of Michigan, and we are trying to make (and find first) some Distributed graph algorithms. I know spark, and I found GraphX. I read the docs, but I only found Latent Dirichlet Allocation algorithms working with GraphX, so I was wondering why ? Basically, the groupe wants to implement Minimal Spanning Tree, kNN, shortest path at first. So my askings are : Is graphX enough stable for developing this kind of algorithms on it ? Do you know some algorithms like these working on top of GraphX ? And if no, why do you think, nobody tried to do it ? Is this too hard ? Or just because nobody needs it ? Maybe, it is only my knowledge about GraphX which is weak, and it is not possible to make these algorithms with GraphX. Thanking you in advance, Best regards, Thibaut -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/GRAPHX-Graph-Algorithms-and-Spark-tp17301.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com<http://nabble.com/>. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org> For additional commands, e-mail: dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>