from:"Zhan Zhang"


[ 
https://issues.apache.org/jira/browse/SPARK-20006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931054#comment-15931054
 ] 

Zhan Zhang edited comment on SPARK-20006 at 3/18/17 4:42 AM:
-

The default ShuffledHashJoin threshold can fallback to the broadcast one. A 
separate configuration does provide us opportunities to optimize the join 
dramatically. It would be great if CBO can automatically find the best 
strategy. But probably I miss something. Currently the CBO does not collect 
right statistics, especially for partitioned table. I have opened a JIRA for 
that issue as well. https://issues.apache.org/jira/browse/SPARK-19890


was (Author: zhzhan):
The default ShuffledHashJoin threshold can fallback to the broadcast one. A 
separate configuration does provide us opportunities to optimize the join 
dramatically. It would be great if CBO can automatically find the best 
strategy. But probably I miss something. Currently the CBO does not collect 
right statistics, especially for partitioned table. 
https://issues.apache.org/jira/browse/SPARK-19890

> Separate threshold for broadcast and shuffled hash join
> ---
>
> Key: SPARK-20006
> URL: https://issues.apache.org/jira/browse/SPARK-20006
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>    Reporter: Zhan Zhang
>Priority: Minor
>
> Currently both canBroadcast and canBuildLocalHashMap use the same 
> configuration: AUTO_BROADCASTJOIN_THRESHOLD. 
> But the memory model may be different. For broadcast, currently the hash map 
> is always build on heap. For shuffledHashJoin, the hash map may be build on 
> heap(longHash), or off heap(other map if off heap is enabled). The same 
> configuration makes the configuration hard to tune (how to allocate memory 
> onheap/offheap). Propose to use different configuration. Please comments 
> whether it is reasonable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20006) Separate threshold for broadcast and shuffled hash join


[ 
https://issues.apache.org/jira/browse/SPARK-20006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931054#comment-15931054
 ] 

Zhan Zhang commented on SPARK-20006:


The default ShuffledHashJoin threshold can fallback to the broadcast one. A 
separate configuration does provide us opportunities to optimize the join 
dramatically. It would be great if CBO can automatically find the best 
strategy. But probably I miss something. Currently the CBO does not collect 
right statistics, especially for partitioned table. 
https://issues.apache.org/jira/browse/SPARK-19890

> Separate threshold for broadcast and shuffled hash join
> ---
>
> Key: SPARK-20006
> URL: https://issues.apache.org/jira/browse/SPARK-20006
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>    Reporter: Zhan Zhang
>Priority: Minor
>
> Currently both canBroadcast and canBuildLocalHashMap use the same 
> configuration: AUTO_BROADCASTJOIN_THRESHOLD. 
> But the memory model may be different. For broadcast, currently the hash map 
> is always build on heap. For shuffledHashJoin, the hash map may be build on 
> heap(longHash), or off heap(other map if off heap is enabled). The same 
> configuration makes the configuration hard to tune (how to allocate memory 
> onheap/offheap). Propose to use different configuration. Please comments 
> whether it is reasonable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20006) Separate threshold for broadcast and shuffled hash join


 [ 
https://issues.apache.org/jira/browse/SPARK-20006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-20006:
---
Description: 
Currently both canBroadcast and canBuildLocalHashMap use the same 
configuration: AUTO_BROADCASTJOIN_THRESHOLD. 

But the memory model may be different. For broadcast, currently the hash map is 
always build on heap. For shuffledHashJoin, the hash map may be build on 
heap(longHash), or off heap(other map if off heap is enabled). The same 
configuration makes the configuration hard to tune (how to allocate memory 
onheap/offheap). Propose to use different configuration. Please comments 
whether it is reasonable.

  was:
Currently both canBroadcast and canBuildLocalHashMap use the same 
configuration: AUTO_BROADCASTJOIN_THRESHOLD. 

But the memory model may be different. For broadcast, currently the hash map is 
always build on heap. For shuffledHashJoin, the hash map may be build on 
heap(longHash), or off heap(other map if off heap is enabled). The same 
configuration makes the configuration hard to tune (how to allocate memory 
onheap/offheap). Propose to use different configuration.


> Separate threshold for broadcast and shuffled hash join
> ---
>
> Key: SPARK-20006
> URL: https://issues.apache.org/jira/browse/SPARK-20006
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>    Reporter: Zhan Zhang
>Priority: Minor
>
> Currently both canBroadcast and canBuildLocalHashMap use the same 
> configuration: AUTO_BROADCASTJOIN_THRESHOLD. 
> But the memory model may be different. For broadcast, currently the hash map 
> is always build on heap. For shuffledHashJoin, the hash map may be build on 
> heap(longHash), or off heap(other map if off heap is enabled). The same 
> configuration makes the configuration hard to tune (how to allocate memory 
> onheap/offheap). Propose to use different configuration. Please comments 
> whether it is reasonable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20006) Separate threshold for broadcast and shuffled hash join

Zhan Zhang created SPARK-20006:
--

 Summary: Separate threshold for broadcast and shuffled hash join
 Key: SPARK-20006
 URL: https://issues.apache.org/jira/browse/SPARK-20006
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Zhan Zhang
Priority: Minor


Currently both canBroadcast and canBuildLocalHashMap use the same 
configuration: AUTO_BROADCASTJOIN_THRESHOLD. 

But the memory model may be different. For broadcast, currently the hash map is 
always build on heap. For shuffledHashJoin, the hash map may be build on 
heap(longHash), or off heap(other map if off heap is enabled). The same 
configuration makes the configuration hard to tune (how to allocate memory 
onheap/offheap). Propose to use different configuration.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19908) Direct buffer memory OOM should not cause stage retries.

2017-03-10 Thread Zhan Zhang (JIRA)

Zhan Zhang created SPARK-19908:
--

 Summary: Direct buffer memory OOM should not cause stage retries.
 Key: SPARK-19908
 URL: https://issues.apache.org/jira/browse/SPARK-19908
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 2.1.0
Reporter: Zhan Zhang
Priority: Minor


Currently if there is  java.lang.OutOfMemoryError: Direct buffer memory, the 
exception will be changed to FetchFailedException, causing stage retries.

org.apache.spark.shuffle.FetchFailedException: Direct buffer memory
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:40)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692)
at 
org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854)
at 
org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887)
at 
org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:212)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
at 
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
at 
io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
at 
io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511

[jira] [Updated] (SPARK-19890) Make MetastoreRelation statistics estimation more accurately

2017-03-09 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-19890:
---
Description: 
Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
and the size is based on the table scope. But for partitioned table, this 
statistics is not useful as table size may 100x+ larger than the input 
partition size. As a result, the join optimization techniques is not applicable.

It would be great if we can postpone the statistics to the optimization phase 
to get partition information but before physical plan generation phase so that 
JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
sortmerjoin).

Although the metastorerelation does not associated with partitions, but through 
PhysicalOperation we can get the partition info for the table. Multiple plan 
can use the same meatstorerelation, but the estimation is still much better 
than table size. This way, retrieving statistics is straightforward.

Another possible way is to have a another data structure associating the 
metastore relation and partitions with the plan to get most accurate estimation.

  was:
Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
and the size is based on the table scope. But for partitioned table, this 
statistics is not useful as table size may 100x+ larger than the input 
partition size. As a result, the join optimization techniques is not applicable.

It would be great if we can postpone the statistics to the optimization phase 
to get partition information but before physical plan generation phase so that 
JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
sortmerjoin).

Although the metastorerelation does not associated with partitions, but through 
PhysicalOperation we can get the partition info for the table. Although 
multiple plan can use the same meatstorerelation, but the estimation still much 
better than table size. This way, retrieving statistics is straightforward.

Another possible way is to have a another data structure associating the 
metastore relation and partitions with the plan to get most accurate estimation.


> Make MetastoreRelation statistics estimation more accurately
> 
>
> Key: SPARK-19890
> URL: https://issues.apache.org/jira/browse/SPARK-19890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>    Reporter: Zhan Zhang
>Priority: Minor
>
> Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
> and the size is based on the table scope. But for partitioned table, this 
> statistics is not useful as table size may 100x+ larger than the input 
> partition size. As a result, the join optimization techniques is not 
> applicable.
> It would be great if we can postpone the statistics to the optimization phase 
> to get partition information but before physical plan generation phase so 
> that JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
> sortmerjoin).
> Although the metastorerelation does not associated with partitions, but 
> through PhysicalOperation we can get the partition info for the table. 
> Multiple plan can use the same meatstorerelation, but the estimation is still 
> much better than table size. This way, retrieving statistics is 
> straightforward.
> Another possible way is to have a another data structure associating the 
> metastore relation and partitions with the plan to get most accurate 
> estimation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19890) Make MetastoreRelation statistics estimation more accurately

2017-03-09 Thread Zhan Zhang (JIRA)

Zhan Zhang created SPARK-19890:
--

 Summary: Make MetastoreRelation statistics estimation more 
accurately
 Key: SPARK-19890
 URL: https://issues.apache.org/jira/browse/SPARK-19890
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Zhan Zhang
Priority: Minor


Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
and the size is based on the table scope. But for partitioned table, this 
statistics is not useful as table size may 100x+ larger than the input 
partition size. As a result, the join optimization techniques is not applicable.

It would be great if we can postpone the statistics to the optimization phase 
to get partition information but before physical plan generation phase so that 
JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
sortmerjoin).

Although the metastorerelation does not associated with partitions, but through 
PhysicalOperation we can get the partition info for the table. Although 
multiple plan can use the same meatstorerelation, but the estimation still much 
better than table size. This way, retrieving statistics is straightforward.

Another possible way is to have a another data structure associating the 
metastore relation and partitions with the plan to get most accurate estimation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19839) Fix memory leak in BytesToBytesMap

2017-03-06 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897975#comment-15897975
 ] 

Zhan Zhang commented on SPARK-19839:


When BytesToBytesMap spills, its longArray should be released. Otherwise, it 
may not released until the task complete. This array may take a significant 
amount of memory, which cannot be used by later operator, such as 
UnsafeShuffleExternalSorter, resulting in more frequent spill in sorter. This 
patch release the array as destructive iterator will not use this array anymore.

> Fix memory leak in BytesToBytesMap
> --
>
> Key: SPARK-19839
> URL: https://issues.apache.org/jira/browse/SPARK-19839
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>    Reporter: Zhan Zhang
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19839) Fix memory leak in BytesToBytesMap

2017-03-06 Thread Zhan Zhang (JIRA)

Zhan Zhang created SPARK-19839:
--

 Summary: Fix memory leak in BytesToBytesMap
 Key: SPARK-19839
 URL: https://issues.apache.org/jira/browse/SPARK-19839
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Zhan Zhang






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19815) Not orderable should be applied to right key instead of left key

2017-03-03 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895522#comment-15895522
 ] 

Zhan Zhang commented on SPARK-19815:


I am thinking the logic again. On the surface, the logic may be correct. Since 
in the join, the left and right key should be the same type. Please close this 
JIRA.

> Not orderable should be applied to right key instead of left key
> 
>
> Key: SPARK-19815
> URL: https://issues.apache.org/jira/browse/SPARK-19815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>    Reporter: Zhan Zhang
>Priority: Minor
>
> When generating ShuffledHashJoinExec, the orderable condition should be 
> applied to right key instead of left key.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19815) Not orderable should be applied to right key instead of left key

2017-03-03 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-19815:
---
Summary: Not orderable should be applied to right key instead of left key  
(was: Not order able should be applied to right key instead of left key)

> Not orderable should be applied to right key instead of left key
> 
>
> Key: SPARK-19815
> URL: https://issues.apache.org/jira/browse/SPARK-19815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>    Reporter: Zhan Zhang
>Priority: Minor
>
> When generating ShuffledHashJoinExec, the orderable condition should be 
> applied to right key instead of left key.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19815) Not order able should be applied to right key instead of left key

2017-03-03 Thread Zhan Zhang (JIRA)

Zhan Zhang created SPARK-19815:
--

 Summary: Not order able should be applied to right key instead of 
left key
 Key: SPARK-19815
 URL: https://issues.apache.org/jira/browse/SPARK-19815
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Zhan Zhang
Priority: Minor


When generating ShuffledHashJoinExec, the orderable condition should be applied 
to right key instead of left key.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19354) Killed tasks are getting marked as FAILED

2017-02-11 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862588#comment-15862588
 ] 

Zhan Zhang commented on SPARK-19354:


This fix is actually critical. In production, we found that this behavior can 
cause job retry and failure especially speculation is enabled.

/cc [~rxin]

Specifically we observe that:
When sorter spill to disk, the task is killed. Then a interruptedExecption is 
thrown. Then OOM will be thrown, which cause the unhandledexception in 
executor, and eventually shutdown the executor. It happens a lot in speculative 
tasks. With healthy tasks in the same executor marked as failed as well. 
Retries will happen. Even worse, such retries may fail again due to same 
reason, eventually causing job failure.

17/02/11 15:39:38 ERROR TaskMemoryManager: error while calling spill() on 
org.apache.spark.shuffle.sort.ShuffleExternalSorter@714b17b0
java.nio.channels.ClosedByInterruptException
at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
at 
org.apache.spark.storage.DiskBlockObjectWriter.commitAndGet(DiskBlockObjectWriter.scala:178)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:186)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:254)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:171)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:245)
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:359)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:382)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:241)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:162)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.shuffle.sort.ShuffleExternalSorter@714b17b0 : null
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:180)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:245)
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:359)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:382)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:241)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:162)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

> Killed tasks are getting marked as FAILED
> -
>
> Key: SPARK-19354
> URL: https://issues.apache.org/jira/browse/SPARK-19354
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Devaraj K
>Priority: Minor
>
> When we enable speculation, we can see there are multiple attempts running 
> for the same task when the first task progress is slow. If any of the task 
> attempt succeeds then the other attempts will be killed, during killing the 
> attempts those attempts are getting marked as failed due to the below error. 
> We need to handle this error and mark the at

[jira] [Commented] (SPARK-13450) SortMergeJoin will OOM when join rows have lot of same keys

2017-01-09 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15812550#comment-15812550
 ] 

Zhan Zhang commented on SPARK-13450:


ExternalAppendOnlyMap estimate the size of the data saved. In SortMergeJoin, I 
think we can leverage UnsafeExternalSorter to get more accurate and 
controllable behavior.

> SortMergeJoin will OOM when join rows have lot of same keys
> ---
>
> Key: SPARK-13450
> URL: https://issues.apache.org/jira/browse/SPARK-13450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.2, 2.1.0
>Reporter: Hong Shen
> Attachments: heap-dump-analysis.png
>
>
>   When I run a sql with join, task throw  java.lang.OutOfMemoryError and sql 
> failed. I have set spark.executor.memory  4096m.
>   SortMergeJoin use a ArrayBuffer[InternalRow] to store bufferedMatches, if 
> the join rows have a lot of same key, it will throw OutOfMemoryError.
> {code}
>   /** Buffered rows from the buffered side of the join. This is empty if 
> there are no matches. */
>   private[this] val bufferedMatches: ArrayBuffer[InternalRow] = new 
> ArrayBuffer[InternalRow]
> {code}
>   Here is the stackTrace:
> {code}
> org.xerial.snappy.SnappyNative.arrayCopy(Native Method)
> org.xerial.snappy.Snappy.arrayCopy(Snappy.java:84)
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190)
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163)
> java.io.DataInputStream.readFully(DataInputStream.java:195)
> java.io.DataInputStream.readLong(DataInputStream.java:416)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123)
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:84)
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoin.scala:300)
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.bufferMatchingRows(SortMergeJoin.scala:329)
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoin.scala:229)
> org.apache.spark.sql.execution.joins.SortMergeJoin$$anonfun$doExecute$1$$anon$1.advanceNext(SortMergeJoin.scala:105)
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741)
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:301)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:301)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> org.apache.spark.scheduler.Task.run(Task.scala:89)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (HBASE-15335) Add composite key support in row key

2016-12-05 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15722901#comment-15722901
 ] 

Zhan Zhang commented on HBASE-15335:


[~tedyu] It seems that I do not have permission to reassign. Could you help on 
this? Thanks.

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch, HBASE-15335-7.patch, HBASE-15335-8.patch, 
> HBASE-15335-9.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SPARK-18637) Stateful UDF should be considered as nondeterministic


[ 
https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706961#comment-15706961
 ] 

Zhan Zhang commented on SPARK-18637:


[~hvanhovell] It is an annotation.

/**
 * UDFType annotations are used to describe properties of a UDF. This gives
 * important information to the optimizer.
 * If the UDF is not deterministic, or if it is stateful, it is necessary to
 * annotate it as such for correctness.
 *
 */
@Public
@Evolving
@Target(ElementType.TYPE)
@Retention(RetentionPolicy.RUNTIME)
@Inherited
public @interface UDFType {
  /**
   * Certain optimizations should not be applied if UDF is not deterministic.
   * Deterministic UDF returns same result each time it is invoked with a
   * particular input. This determinism just needs to hold within the context of
   * a query.
   *
   * @return true if the UDF is deterministic
   */
  boolean deterministic() default true;

  /**
   * If a UDF stores state based on the sequence of records it has processed, it
   * is stateful. A stateful UDF cannot be used in certain expressions such as
   * case statement and certain optimizations such as AND/OR short circuiting
   * don't apply for such UDFs, as they need to be invoked for each record.
   * row_sequence is an example of stateful UDF. A stateful UDF is considered to
   * be non-deterministic, irrespective of what deterministic() returns.
   *
   * @return true
   */
  boolean stateful() default false;

  /**
   * A UDF is considered distinctLike if the UDF can be evaluated on just the
   * distinct values of a column. Examples include min and max UDFs. This
   * information is used by metadata-only optimizer.
   *
   * @return true if UDF is distinctLike
   */
  boolean distinctLike() default false;

  /**
   * Using in analytical functions to specify that UDF implies an ordering
   *
   * @return true if the function implies order
   */
  boolean impliesOrder() default false;
}


> Stateful UDF should be considered as nondeterministic
> -
>
> Key: SPARK-18637
> URL: https://issues.apache.org/jira/browse/SPARK-18637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>    Reporter: Zhan Zhang
>
> If the annotation UDFType of a udf is stateful, it shoudl be considered as 
> non-deterministic. Otherwise, the catalyst may optimize the plan and return 
> the wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18637) Stateful UDF should be considered as nondeterministic


[ 
https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706961#comment-15706961
 ] 

Zhan Zhang edited comment on SPARK-18637 at 11/29/16 11:52 PM:
---

[~hvanhovell] It is an annotation.

/**
 * UDFType annotations are used to describe properties of a UDF. This gives
 * important information to the optimizer.
 * If the UDF is not deterministic, or if it is stateful, it is necessary to
 * annotate it as such for correctness.
 *
 */


was (Author: zhzhan):
[~hvanhovell] It is an annotation.

/**
 * UDFType annotations are used to describe properties of a UDF. This gives
 * important information to the optimizer.
 * If the UDF is not deterministic, or if it is stateful, it is necessary to
 * annotate it as such for correctness.
 *
 */
@Public
@Evolving
@Target(ElementType.TYPE)
@Retention(RetentionPolicy.RUNTIME)
@Inherited
public @interface UDFType {
  /**
   * Certain optimizations should not be applied if UDF is not deterministic.
   * Deterministic UDF returns same result each time it is invoked with a
   * particular input. This determinism just needs to hold within the context of
   * a query.
   *
   * @return true if the UDF is deterministic
   */
  boolean deterministic() default true;

  /**
   * If a UDF stores state based on the sequence of records it has processed, it
   * is stateful. A stateful UDF cannot be used in certain expressions such as
   * case statement and certain optimizations such as AND/OR short circuiting
   * don't apply for such UDFs, as they need to be invoked for each record.
   * row_sequence is an example of stateful UDF. A stateful UDF is considered to
   * be non-deterministic, irrespective of what deterministic() returns.
   *
   * @return true
   */
  boolean stateful() default false;

  /**
   * A UDF is considered distinctLike if the UDF can be evaluated on just the
   * distinct values of a column. Examples include min and max UDFs. This
   * information is used by metadata-only optimizer.
   *
   * @return true if UDF is distinctLike
   */
  boolean distinctLike() default false;

  /**
   * Using in analytical functions to specify that UDF implies an ordering
   *
   * @return true if the function implies order
   */
  boolean impliesOrder() default false;
}


> Stateful UDF should be considered as nondeterministic
> -
>
> Key: SPARK-18637
> URL: https://issues.apache.org/jira/browse/SPARK-18637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>    Reporter: Zhan Zhang
>
> If the annotation UDFType of a udf is stateful, it shoudl be considered as 
> non-deterministic. Otherwise, the catalyst may optimize the plan and return 
> the wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18637) Stateful UDF should be considered as nondeterministic


 [ 
https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-18637:
---
Component/s: SQL

> Stateful UDF should be considered as nondeterministic
> -
>
> Key: SPARK-18637
> URL: https://issues.apache.org/jira/browse/SPARK-18637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>    Reporter: Zhan Zhang
>
> If the annotation UDFType of a udf is stateful, it shoudl be considered as 
> non-deterministic. Otherwise, the catalyst may optimize the plan and return 
> the wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18637) Stateful UDF should be considered as nondeterministic


[ 
https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706905#comment-15706905
 ] 

Zhan Zhang commented on SPARK-18637:


Here is the comments from UDFType 
  /**
   * If a UDF stores state based on the sequence of records it has processed, it
   * is stateful. A stateful UDF cannot be used in certain expressions such as
   * case statement and certain optimizations such as AND/OR short circuiting
   * don't apply for such UDFs, as they need to be invoked for each record.
   * row_sequence is an example of stateful UDF. A stateful UDF is considered to
   * be non-deterministic, irrespective of what deterministic() returns.
   *
   * @return true
   */
  boolean stateful() default false;

> Stateful UDF should be considered as nondeterministic
> -
>
> Key: SPARK-18637
> URL: https://issues.apache.org/jira/browse/SPARK-18637
> Project: Spark
>  Issue Type: Bug
>    Reporter: Zhan Zhang
>
> If the annotation UDFType of a udf is stateful, it shoudl be considered as 
> non-deterministic. Otherwise, the catalyst may optimize the plan and return 
> the wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18637) Stateful UDF should be considered as nondeterministic

Zhan Zhang created SPARK-18637:
--

 Summary: Stateful UDF should be considered as nondeterministic
 Key: SPARK-18637
 URL: https://issues.apache.org/jira/browse/SPARK-18637
 Project: Spark
  Issue Type: Bug
Reporter: Zhan Zhang


If the annotation UDFType of a udf is stateful, it shoudl be considered as 
non-deterministic. Otherwise, the catalyst may optimize the plan and return the 
wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18550) Make the queue capacity of LiveListenerBus configurable.

2016-11-22 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15688361#comment-15688361
 ] 

Zhan Zhang commented on SPARK-18550:


I was not aware it has been fixed already. Please help to close it.

> Make the queue capacity of LiveListenerBus configurable.
> 
>
> Key: SPARK-18550
> URL: https://issues.apache.org/jira/browse/SPARK-18550
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>    Reporter: Zhan Zhang
>Priority: Minor
>
> We meet issues that driver listener bus cannot catch up the speed of incoming 
> event. Current value is fixed as 1000. This value should be configurable per 
> job. Otherwise, when event is dropped, the UI is totally useless.
> Bus: Dropping SparkListenerEvent because no remaining room in event queue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18550) Make the queue capacity of LiveListenerBus configurable.

2016-11-22 Thread Zhan Zhang (JIRA)

Zhan Zhang created SPARK-18550:
--

 Summary: Make the queue capacity of LiveListenerBus configurable.
 Key: SPARK-18550
 URL: https://issues.apache.org/jira/browse/SPARK-18550
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Zhan Zhang
Priority: Minor


We meet issues that driver listener bus cannot catch up the speed of incoming 
event. Current value is fixed as 1000. This value should be configurable per 
job. Otherwise, when event is dropped, the UI is totally useless.

Bus: Dropping SparkListenerEvent because no remaining room in event queue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-11-22 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687474#comment-15687474
 ] 

Zhan Zhang commented on SPARK-17637:


[~hvanhovell] Thanks. PR is updated with conflicts resolved.

> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
>Priority: Minor
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

SparkPlan/Shuffle stage reuse with Dataset/DataFrame

2016-10-18 Thread Zhan Zhang

Hi Folks,

We have some Dataset/Dataframe use cases that will benefit from reuse the
SparkPlan and shuffle stage.

For example, the following cases. Because the query optimization and
sparkplan is generated by catalyst when it is executed, as a result, the
underlying RDD lineage is regenerated for dataset1. Thus, the shuffle stage
will be executed multiple times.

val dataset1 = dataset.groupby.agg
df.registerTempTable("tmpTable")
spark.sql("select * from tmpTable where condition").collect
spark.sql("select * from tmpTable where condition1").cllect

On the one side, we get optimized query plan, but on the other side, we
cannot reuse the data generated by shuffle stage.

Currently, to reuse the dataset1, we have to use persist to cache the data.
It is helpful but sometimes is not what we want, as it has some side effect.
For example, we cannot release the executor that has active cache in it even
it is idle and dynamic allocator is enabled.

In other words, we only want to reuse the shuffle data as much as possible
without caching in a long pipeline with multiple shuffle stages.

I am wondering does it make sense to add a new feature to Dataset/Dataframe
to work as barrier and prevent the query optimization happens across the
barrier.

For example, in the above case, we want catalyst take tmpTable as a barrier,
and stop optimization across it, so that we can reuse the underlying rdd
lineage of dataset1.

The prototype code to make it work is quite small, and we tried in house
with a new API as Dataset.cacheShuffle to make this happen.

But I want some feedback from community before opening a JIRA, as in some
sense, it does stop the optimization earlier. Any comments?

--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkPlan-Shuffle-stage-reuse-with-Dataset-DataFrame-tp19502.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[jira] [Commented] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-09-23 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15516851#comment-15516851
 ] 

Zhan Zhang commented on SPARK-17637:


[~jerryshao] The idea is straightforward. Instead of doing round robin on the 
executers with available cores, the new scheduling will try to allocate tasks 
to the executors with least available cores. As a result for the executors who 
has more free resources may not have new tasks allocated. With dynamic 
allocation enabled, these executors may be released so that other jobs can get 
required resources from underlying resource manager.

It is not specific bound to dynamic allocation, but it is an easy way to 
understand the gains of the new scheduler. In addition, in the patch (soon to 
be sent out) there is also another scheduler which does exactly opposite thing 
by allocating tasks to executors with most available cores in order to balance 
the workload to all executors.

> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>    Reporter: Zhan Zhang
>Priority: Minor
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-09-22 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15515247#comment-15515247
 ] 

Zhan Zhang commented on SPARK-17637:


cc [~rxin]
A quick prototype shows that  for a tested pipeline, the job can save around 
45% regarding the reserved cpu and memory when the dynamic allocation is 
enabled.

> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>    Reporter: Zhan Zhang
>Priority: Minor
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-09-22 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514105#comment-15514105
 ] 

Zhan Zhang commented on SPARK-17637:


The plan is to introduce a new configuration so that different scheduling 
algorithms can be used for the task scheduling.

> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>    Reporter: Zhan Zhang
>Priority: Minor
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-09-22 Thread Zhan Zhang (JIRA)

Zhan Zhang created SPARK-17637:
--

 Summary: Packed scheduling for Spark tasks across executors
 Key: SPARK-17637
 URL: https://issues.apache.org/jira/browse/SPARK-17637
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: Zhan Zhang
Priority: Minor


Currently Spark scheduler implements round robin scheduling for tasks to 
executors. Which is great as it distributes the load evenly across the cluster, 
but this leads to significant resource waste in some cases, especially when 
dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17526) Display the executor log links with the job failure message on Spark UI and Console

2016-09-13 Thread Zhan Zhang (JIRA)

Zhan Zhang created SPARK-17526:
--

 Summary: Display the executor log links with the job failure 
message on Spark UI and Console
 Key: SPARK-17526
 URL: https://issues.apache.org/jira/browse/SPARK-17526
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Zhan Zhang
Priority: Minor


 Display the executor log links with the job failure message on Spark UI and 
Console

"Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most 
recent failure: Lost task 0.3 in stage 0.0 (TID 3, HostName): 
java.lang.Exception: foo"

To make this failure message more helpful, we should have the executor log link 
in the driver log and web ui as well on which the task failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (HBASE-15335) Add composite key support in row key

2016-07-14 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15378457#comment-15378457
 ] 

Zhan Zhang commented on HBASE-15335:


[~tedyu] The scaladoc warning seems to be false positive, as I didn't see the 
map of the comments of method apply in object HBaseTableCatalo

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch, HBASE-15335-7.patch, HBASE-15335-8.patch, 
> HBASE-15335-9.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-07-14 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-9.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch, HBASE-15335-7.patch, HBASE-15335-8.patch, 
> HBASE-15335-9.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15335) Add composite key support in row key

2016-07-12 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-8.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch, HBASE-15335-7.patch, HBASE-15335-8.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Anyone knows the hive repo for spark-2.0?

2016-07-07 Thread Zhan Zhang

I saw the pom file having hive version as
1.2.1.spark2. But I cannot find the branch in 
https://github.com/pwendell/

Does anyone know where the repo is?

Thanks.

Zhan Zhang




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Anyone-knows-the-hive-repo-for-spark-2-0-tp18234.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[jira] [Updated] (HBASE-15335) Add composite key support in row key


 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Status: Patch Available  (was: Open)

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch, HBASE-15335-7.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15335) Add composite key support in row key


 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Status: Open  (was: Patch Available)

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch, HBASE-15335-7.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15335) Add composite key support in row key


 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Status: Open  (was: Patch Available)

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch, HBASE-15335-7.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15335) Add composite key support in row key


 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Status: Patch Available  (was: Open)

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch, HBASE-15335-7.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15335) Add composite key support in row key


 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-7.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch, HBASE-15335-7.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter


 [ 
https://issues.apache.org/jira/browse/HBASE-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-16017:
---
Attachment: HBASE-16017-1.patch

> HBase TableOutputFormat has connection leak in getRecordWriter
> --
>
> Key: HBASE-16017
> URL: https://issues.apache.org/jira/browse/HBASE-16017
> Project: HBase
>  Issue Type: Bug
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-16017-1.patch
>
>
> Currently getRecordWriter will not release the connection until jvm 
> terminate, which is not a right assumption given that the function may be 
> invoked many times in the same jvm lifecycle. Inside of mapreduce, the issue 
> has already fixed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter


 [ 
https://issues.apache.org/jira/browse/HBASE-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-16017:
---
Attachment: (was: HBASE-16017-1.patch)

> HBase TableOutputFormat has connection leak in getRecordWriter
> --
>
> Key: HBASE-16017
> URL: https://issues.apache.org/jira/browse/HBASE-16017
> Project: HBase
>  Issue Type: Bug
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-16017-1.patch
>
>
> Currently getRecordWriter will not release the connection until jvm 
> terminate, which is not a right assumption given that the function may be 
> invoked many times in the same jvm lifecycle. Inside of mapreduce, the issue 
> has already fixed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter


 [ 
https://issues.apache.org/jira/browse/HBASE-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-16017:
---
Status: Patch Available  (was: Open)

> HBase TableOutputFormat has connection leak in getRecordWriter
> --
>
> Key: HBASE-16017
> URL: https://issues.apache.org/jira/browse/HBASE-16017
> Project: HBase
>  Issue Type: Bug
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-16017-1.patch
>
>
> Currently getRecordWriter will not release the connection until jvm 
> terminate, which is not a right assumption given that the function may be 
> invoked many times in the same jvm lifecycle. Inside of mapreduce, the issue 
> has already fixed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter


[ 
https://issues.apache.org/jira/browse/HBASE-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328354#comment-15328354
 ] 

Zhan Zhang commented on HBASE-16017:


[~te...@apache.org] Can you please take a look? It is a simple fix. To make the 
impact as small as possible,  I added a new constructor, so that there is no 
change to any other modules.

> HBase TableOutputFormat has connection leak in getRecordWriter
> --
>
> Key: HBASE-16017
> URL: https://issues.apache.org/jira/browse/HBASE-16017
> Project: HBase
>  Issue Type: Bug
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-16017-1.patch
>
>
> Currently getRecordWriter will not release the connection until jvm 
> terminate, which is not a right assumption given that the function may be 
> invoked many times in the same jvm lifecycle. Inside of mapreduce, the issue 
> has already fixed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter


 [ 
https://issues.apache.org/jira/browse/HBASE-16017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-16017:
---
Attachment: HBASE-16017-1.patch

> HBase TableOutputFormat has connection leak in getRecordWriter
> --
>
> Key: HBASE-16017
> URL: https://issues.apache.org/jira/browse/HBASE-16017
> Project: HBase
>  Issue Type: Bug
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-16017-1.patch
>
>
> Currently getRecordWriter will not release the connection until jvm 
> terminate, which is not a right assumption given that the function may be 
> invoked many times in the same jvm lifecycle. Inside of mapreduce, the issue 
> has already fixed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter

Zhan Zhang created HBASE-16017:
--

 Summary: HBase TableOutputFormat has connection leak in 
getRecordWriter
 Key: HBASE-16017
 URL: https://issues.apache.org/jira/browse/HBASE-16017
 Project: HBase
  Issue Type: Bug
Reporter: Zhan Zhang
Assignee: Zhan Zhang


Currently getRecordWriter will not release the connection until jvm terminate, 
which is not a right assumption given that the function may be invoked many 
times in the same jvm lifecycle. Inside of mapreduce, the issue has already 
fixed. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-16017) HBase TableOutputFormat has connection leak in getRecordWriter

Zhan Zhang created HBASE-16017:
--

 Summary: HBase TableOutputFormat has connection leak in 
getRecordWriter
 Key: HBASE-16017
 URL: https://issues.apache.org/jira/browse/HBASE-16017
 Project: HBase
  Issue Type: Bug
Reporter: Zhan Zhang
Assignee: Zhan Zhang


Currently getRecordWriter will not release the connection until jvm terminate, 
which is not a right assumption given that the function may be invoked many 
times in the same jvm lifecycle. Inside of mapreduce, the issue has already 
fixed. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15335) Add composite key support in row key


 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-6.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch, 
> HBASE-15335-6.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15335) Add composite key support in row key


 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-5.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch, HBASE-15335-5.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (SPARK-15848) Spark unable to read partitioned table in avro format and column name in upper case


 [ 
https://issues.apache.org/jira/browse/SPARK-15848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-15848:
---
Affects Version/s: 1.6.1

> Spark unable to read partitioned table in avro format and column name in 
> upper case
> ---
>
> Key: SPARK-15848
> URL: https://issues.apache.org/jira/browse/SPARK-15848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Zhan Zhang
>
> If external partitioned Hive tables created in Avro format.
> Spark is returning "null" values if columns names are in Uppercase in the 
> Avro schema.
> The same tables return proper data when queried in the Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15848) Spark unable to read partitioned table in avro format and column name in upper case


[ 
https://issues.apache.org/jira/browse/SPARK-15848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323195#comment-15323195
 ] 

Zhan Zhang commented on SPARK-15848:


cat > file1.csv< file2.csv< val tbl = sqlContext.table("default.avro_table_uppercase");

scala> tbl.show

+--+--+-++
|student_id|subject_id|marks|year|
+--+--+-++
|  null|  null|  100|2000|
|  null|  null|   20|2000|
|  null|  null|  160|2000|
|  null|  null|  963|2000|
|  null|  null|  142|2000|
|  null|  null|  430|2000|
|  null|  null|   91|2002|
|  null|  null|   28|2002|
|  null|  null|   16|2002|
|  null|  null|   96|2002|
|  null|  null|   14|2002|
|  null|  null|   43|2002|
+--+--+-++

> Spark unable to read partitioned table in avro format and column name in 
> upper case
> ---
>
> Key: SPARK-15848
> URL: https://issues.apache.org/jira/browse/SPARK-15848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Zhan Zhang
>
> If external partitioned Hive tables created in Avro format.
> Spark is returning "null" values if columns names are in Uppercase in the 
> Avro schema.
> The same tables return proper data when queried in the Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15848) Spark unable to read partitioned table in avro format and column name in upper case

Zhan Zhang created SPARK-15848:
--

 Summary: Spark unable to read partitioned table in avro format and 
column name in upper case
 Key: SPARK-15848
 URL: https://issues.apache.org/jira/browse/SPARK-15848
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhan Zhang


If external partitioned Hive tables created in Avro format.
Spark is returning "null" values if columns names are in Uppercase in the Avro 
schema.
The same tables return proper data when queried in the Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (HBASE-15335) Add composite key support in row key


 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-4.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch, HBASE-15335-4.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15473) Documentation for the usage of hbase dataframe user api (JSON, Avro, etc)

2016-06-01 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-15473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15473:
---
Assignee: (was: Zhan Zhang)

> Documentation for the usage of hbase dataframe user api (JSON, Avro, etc)
> -
>
> Key: HBASE-15473
> URL: https://issues.apache.org/jira/browse/HBASE-15473
> Project: HBase
>  Issue Type: Sub-task
>  Components: documentation, spark
>    Reporter: Zhan Zhang
>Priority: Blocker
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format

2016-06-01 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang reassigned HBASE-14801:
--

Assignee: Zhan Zhang

> Enhance the Spark-HBase connector catalog with json format
> --
>
> Key: HBASE-14801
> URL: https://issues.apache.org/jira/browse/HBASE-14801
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch, 
> HBASE-14801-3.patch, HBASE-14801-4.patch, HBASE-14801-5.patch, 
> HBASE-14801-6.patch, HBASE-14801-7.patch, HBASE-14801-8.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14801) Enhance the Spark-HBase connector catalog with json format

2016-06-01 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-14801:
---
Assignee: (was: Zhan Zhang)

> Enhance the Spark-HBase connector catalog with json format
> --
>
> Key: HBASE-14801
> URL: https://issues.apache.org/jira/browse/HBASE-14801
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
> Attachments: HBASE-14801-1.patch, HBASE-14801-2.patch, 
> HBASE-14801-3.patch, HBASE-14801-4.patch, HBASE-14801-5.patch, 
> HBASE-14801-6.patch, HBASE-14801-7.patch, HBASE-14801-8.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SPARK-15441) dataset outer join seems to return incorrect result

2016-05-24 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297881#comment-15297881
 ] 

Zhan Zhang commented on SPARK-15441:


Currently new GenericInternalRow(right.output.length) is used as nullRow, but 
actually it cannot be used to identify the difference of row itself is null or 
all columns are null. Probably we can add a special row nullRow to represent 
that the InternalRow itself is null, so that Encoder can identify whether the 
object itself is null or not. 

> dataset outer join seems to return incorrect result
> ---
>
> Key: SPARK-15441
> URL: https://issues.apache.org/jira/browse/SPARK-15441
> Project: Spark
>  Issue Type: Bug
>  Components: sq;
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>Priority: Critical
>
> See notebook
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2836020637783173/5382278320999420/latest.html
> {code}
> import org.apache.spark.sql.functions
> val left = List(("a", 1), ("a", 2), ("b", 3), ("c", 4)).toDS()
> val right = List(("a", "x"), ("b", "y"), ("d", "z")).toDS()
> // The last row _1 should be null, rather than (null, -1)
> left.toDF("k", "v").as[(String, Int)].alias("left")
>   .joinWith(right.toDF("k", "u").as[(String, String)].alias("right"), 
> functions.col("left.k") === functions.col("right.k"), "right_outer")
>   .show()
> {code}
> The returned result currently is
> {code}
> +-+-+
> |   _1|   _2|
> +-+-+
> |(a,2)|(a,x)|
> |(a,1)|(a,x)|
> |(b,3)|(b,y)|
> |(null,-1)|(d,z)|
> +-+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: right outer joins on Datasets

2016-05-24 Thread Zhan Zhang

The reason for "-1" is that the default value for Integer is -1 if the value
is null

  def defaultValue(jt: String): String = jt match {
...
case JAVA_INT => "-1"
...   
 }



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/right-outer-joins-on-Datasets-tp17542p17651.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[jira] [Updated] (HBASE-15335) Add composite key support in row key


 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Status: Patch Available  (was: Open)

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15335) Add composite key support in row key


 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Status: Open  (was: Patch Available)

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15335) Add composite key support in row key


 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-3.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch, 
> HBASE-15335-3.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-15473) Documentation for the usage of hbase dataframe user api (JSON, Avro, etc)


[ 
https://issues.apache.org/jira/browse/HBASE-15473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289993#comment-15289993
 ] 

Zhan Zhang commented on HBASE-15473:


[~WeiqingYang] is working on this.

> Documentation for the usage of hbase dataframe user api (JSON, Avro, etc)
> -
>
> Key: HBASE-15473
> URL: https://issues.apache.org/jira/browse/HBASE-15473
> Project: HBase
>  Issue Type: Sub-task
>  Components: documentation, spark
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
>Priority: Blocker
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15335) Add composite key support in row key


 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-2.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15335) Add composite key support in row key


 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-2.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15335) Add composite key support in row key


 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: HBASE-15335-2.patch

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15335) Add composite key support in row key


 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: (was: HBASE-15335-2.patch)

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch, HBASE-15335-2.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15335) Add composite key support in row key


 [ 
https://issues.apache.org/jira/browse/HBASE-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15335:
---
Attachment: (was: HBASE-15335-2.patch)

> Add composite key support in row key
> 
>
> Key: HBASE-15335
> URL: https://issues.apache.org/jira/browse/HBASE-15335
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Attachments: HBASE-15335-1.patch
>
>
> Add composite key filter support in the connector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-15825) Fix the null pointer in DynamicLogicExpressionSuite

2016-05-16 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284923#comment-15284923
 ] 

Zhan Zhang commented on HBASE-15825:


[~ted_yu] Thanks a lot

> Fix the null pointer in DynamicLogicExpressionSuite
> ---
>
> Key: HBASE-15825
> URL: https://issues.apache.org/jira/browse/HBASE-15825
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
>    Assignee: Zhan Zhang
> Fix For: 2.0.0
>
> Attachments: HBASE-15825-1.patch
>
>
> It only happens in test cases. Not sure why it is not caught. Will submit 
> patch soon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15825) Fix the null pointer in DynamicLogicExpressionSuite


 [ 
https://issues.apache.org/jira/browse/HBASE-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15825:
---
Status: Patch Available  (was: Open)

> Fix the null pointer in DynamicLogicExpressionSuite
> ---
>
> Key: HBASE-15825
> URL: https://issues.apache.org/jira/browse/HBASE-15825
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
> Fix For: 2.0.0
>
> Attachments: HBASE-15825-1.patch
>
>
> It only happens in test cases. Not sure why it is not caught. Will submit 
> patch soon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-15825) Fix the null pointer in DynamicLogicExpressionSuite


[ 
https://issues.apache.org/jira/browse/HBASE-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283172#comment-15283172
 ] 

Zhan Zhang commented on HBASE-15825:


[~te...@apache.org] [~jmhsieh] Can you please take a quick look? It is a 
straightforward fix.

> Fix the null pointer in DynamicLogicExpressionSuite
> ---
>
> Key: HBASE-15825
> URL: https://issues.apache.org/jira/browse/HBASE-15825
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
> Fix For: 2.0.0
>
> Attachments: HBASE-15825-1.patch
>
>
> It only happens in test cases. Not sure why it is not caught. Will submit 
> patch soon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-15825) Fix the null pointer in DynamicLogicExpressionSuite


 [ 
https://issues.apache.org/jira/browse/HBASE-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated HBASE-15825:
---
Attachment: HBASE-15825-1.patch

> Fix the null pointer in DynamicLogicExpressionSuite
> ---
>
> Key: HBASE-15825
> URL: https://issues.apache.org/jira/browse/HBASE-15825
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Zhan Zhang
> Fix For: 2.0.0
>
> Attachments: HBASE-15825-1.patch
>
>
> It only happens in test cases. Not sure why it is not caught. Will submit 
> patch soon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-15333) [hbase-spark] Enhance dataframe filters to handle naively encoded short, integer, long, float and double


[ 
https://issues.apache.org/jira/browse/HBASE-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283159#comment-15283159
 ] 

Zhan Zhang commented on HBASE-15333:


Open jira HBASE-15825 to fix the test case.

> [hbase-spark] Enhance dataframe filters to handle naively encoded short, 
> integer, long, float and double
> 
>
> Key: HBASE-15333
> URL: https://issues.apache.org/jira/browse/HBASE-15333
> Project: HBase
>  Issue Type: Sub-task
>  Components: spark
>    Reporter: Zhan Zhang
>Assignee: Zhan Zhang
> Fix For: 2.0.0
>
> Attachments: HBASE-15333-1.patch, HBASE-15333-10.patch, 
> HBASE-15333-2.patch, HBASE-15333-3.patch, HBASE-15333-4.patch, 
> HBASE-15333-5.patch, HBASE-15333-6.patch, HBASE-15333-7.patch, 
> HBASE-15333-8.patch, HBASE-15333-9.patch
>
>
> Currently, the range filter is based on the order of bytes. But for java 
> primitive type, such as short, int, long, double, float, etc, their order is 
> not consistent with their byte order, extra manipulation has to be in place 
> to take care of them  correctly.
> For example, for the integer range (-100, 100), the filter <= 1, the current 
> filter will return 0 and 1, and the right return value should be (-100, 1]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-15825) Fix the null pointer in DynamicLogicExpressionSuite

Zhan Zhang created HBASE-15825:
--

 Summary: Fix the null pointer in DynamicLogicExpressionSuite
 Key: HBASE-15825
 URL: https://issues.apache.org/jira/browse/HBASE-15825
 Project: HBase
  Issue Type: Sub-task
Reporter: Zhan Zhang


It only happens in test cases. Not sure why it is not caught. Will submit patch 
soon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-15825) Fix the null pointer in DynamicLogicExpressionSuite

Zhan Zhang created HBASE-15825:
--

 Summary: Fix the null pointer in DynamicLogicExpressionSuite
 Key: HBASE-15825
 URL: https://issues.apache.org/jira/browse/HBASE-15825
 Project: HBase
  Issue Type: Sub-task
Reporter: Zhan Zhang


It only happens in test cases. Not sure why it is not caught. Will submit patch 
soon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-15333) [hbase-spark] Enhance dataframe filters to handle naively encoded short, integer, long, float and double