[jira] [Commented] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory

2019-04-25 Thread Mike Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825798#comment-16825798
 ] 

Mike Chan commented on SPARK-13510:
---

Thanks man

> Shuffle may throw FetchFailedException: Direct buffer memory
> 
>
> Key: SPARK-13510
> URL: https://issues.apache.org/jira/browse/SPARK-13510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Hong Shen
>Priority: Major
> Attachments: spark-13510.diff
>
>
> In our cluster, when I test spark-1.6.0 with a sql, it throw exception and 
> failed.
> {code}
> 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request 
> for 1 blocks (915.4 MB) from 10.196.134.220:7337
> 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch 
> from 10.196.134.220:7337 (executor id 122)
> 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 
> to /10.196.134.220:7337
> 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in 
> connection from /10.196.134.220:7337
> java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:658)
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645)
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:212)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>   at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:744)
> 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 
> requests outstanding when connection from /10.196.134.220:7337 is closed
> 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block 
> shuffle_3_81_2, and will not retry (0 retries)
> {code}
>   The reason is that when shuffle a big block(like 1G), task will allocate 
> the same memory, it will easily throw "FetchFailedException: Direct buffer 
> memory".
>   If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will 
> throw 
> {code}
> java.lang.OutOfMemoryError: Java heap space
> at 
> io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607)
> at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:215)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
> {code}
>   
>   In mapreduce shuffle, it will firstly judge whether the block can cache in 
> memery, but spark doesn't. 
>   If the block is more than we can cache in memory, we  should write to disk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory

2019-04-24 Thread Mike Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825721#comment-16825721
 ] 

Mike Chan commented on SPARK-13510:
---

Hi [~belvey] I'm having similar issue and our 
"spark.maxRemoteBlockSizeFetchToMem" is at 188. From the forums I can tell 
this parameter should be set below 2GB. How do you set your parameter? Should 
it be "2g" or 2 * 1024 * 1024 * 1024 = 2147483648? I'm at Spark 2.3 on Azure 

> Shuffle may throw FetchFailedException: Direct buffer memory
> 
>
> Key: SPARK-13510
> URL: https://issues.apache.org/jira/browse/SPARK-13510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Hong Shen
>Priority: Major
> Attachments: spark-13510.diff
>
>
> In our cluster, when I test spark-1.6.0 with a sql, it throw exception and 
> failed.
> {code}
> 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request 
> for 1 blocks (915.4 MB) from 10.196.134.220:7337
> 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch 
> from 10.196.134.220:7337 (executor id 122)
> 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 
> to /10.196.134.220:7337
> 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in 
> connection from /10.196.134.220:7337
> java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:658)
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645)
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:212)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>   at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:744)
> 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 
> requests outstanding when connection from /10.196.134.220:7337 is closed
> 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block 
> shuffle_3_81_2, and will not retry (0 retries)
> {code}
>   The reason is that when shuffle a big block(like 1G), task will allocate 
> the same memory, it will easily throw "FetchFailedException: Direct buffer 
> memory".
>   If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will 
> throw 
> {code}
> java.lang.OutOfMemoryError: Java heap space
> at 
> io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607)
> at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:215)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
> {code}
>   
>   In mapreduce shuffle, it will firstly judge whether the block can cache in 
> memery, but spark doesn't. 
>   If the block is more than we can cache in memory, we  should write to disk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27505) autoBroadcastJoinThreshold including bigger table

2019-04-22 Thread Mike Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823637#comment-16823637
 ] 

Mike Chan commented on SPARK-27505:
---

You mind sharing any info on self-reproducer? Tried to google myself but 
nothing came through. Thank you.

> autoBroadcastJoinThreshold including bigger table
> -
>
> Key: SPARK-27505
> URL: https://issues.apache.org/jira/browse/SPARK-27505
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Hive table with Spark 2.3.1 on Azure, using Azure 
> storage as storage layer
>Reporter: Mike Chan
>Priority: Major
> Attachments: explain_plan.txt
>
>
> I'm on a case that when certain table being exposed to broadcast join, the 
> query will eventually failed with remote block error. 
>  
> Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 
> 10485760
> !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.2=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI=emb=ii_jumg5jxd1|width=542,height=66!
>  
> Then we proceed to perform query. In the SQL plan, we found that one table 
> that is 25MB in size is broadcast as well.
>  
> !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.1=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE=emb=ii_jumg53fq0|width=227,height=542!
>  
> Also in desc extended the table is 24452111 bytes. It is a Hive table. We 
> always ran into error when this table being broadcast. Below is the sample 
> error
>  
> Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt 
> remote block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) 
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
>  
> Also attached the physical plan if you're interested. One thing to note that, 
> if I turn down autoBroadcastJoinThreshold{color:#00}to 5MB, this query 
> will get successfully executed and default.product NOT broadcasted.{color}
> {color:#00}
> {color}{color:#00}However, when I change to another query that querying 
> even less columns than pervious one, even in 5MB this table still get 
> broadcasted and failed with the same error. I even changed to 1MB and still 
> the same. {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27505) autoBroadcastJoinThreshold including bigger table

2019-04-19 Thread Mike Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Chan updated SPARK-27505:
--
Issue Type: Bug  (was: Question)

> autoBroadcastJoinThreshold including bigger table
> -
>
> Key: SPARK-27505
> URL: https://issues.apache.org/jira/browse/SPARK-27505
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Hive table with Spark 2.3.1 on Azure, using Azure 
> storage as storage layer
>Reporter: Mike Chan
>Priority: Major
> Attachments: explain_plan.txt
>
>
> I'm on a case that when certain table being exposed to broadcast join, the 
> query will eventually failed with remote block error. 
>  
> Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 
> 10485760
> !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.2=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI=emb=ii_jumg5jxd1|width=542,height=66!
>  
> Then we proceed to perform query. In the SQL plan, we found that one table 
> that is 25MB in size is broadcast as well.
>  
> !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.1=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE=emb=ii_jumg53fq0|width=227,height=542!
>  
> Also in desc extended the table is 24452111 bytes. It is a Hive table. We 
> always ran into error when this table being broadcast. Below is the sample 
> error
>  
> Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt 
> remote block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) 
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
>  
> Also attached the physical plan if you're interested. One thing to note that, 
> if I turn down autoBroadcastJoinThreshold{color:#00}to 5MB, this query 
> will get successfully executed and default.product NOT broadcasted.{color}
> {color:#00}
> {color}{color:#00}However, when I change to another query that querying 
> even less columns than pervious one, even in 5MB this table still get 
> broadcasted and failed with the same error. I even changed to 1MB and still 
> the same. {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27505) autoBroadcastJoinThreshold including bigger table

2019-04-18 Thread Mike Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820951#comment-16820951
 ] 

Mike Chan commented on SPARK-27505:
---

Table desc extended result:

Statistics |24452111 bytes

> autoBroadcastJoinThreshold including bigger table
> -
>
> Key: SPARK-27505
> URL: https://issues.apache.org/jira/browse/SPARK-27505
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Hive table with Spark 2.3.1 on Azure, using Azure 
> storage as storage layer
>Reporter: Mike Chan
>Priority: Major
> Attachments: explain_plan.txt
>
>
> I'm on a case that when certain table being exposed to broadcast join, the 
> query will eventually failed with remote block error. 
>  
> Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 
> 10485760
> !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.2=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI=emb=ii_jumg5jxd1|width=542,height=66!
>  
> Then we proceed to perform query. In the SQL plan, we found that one table 
> that is 25MB in size is broadcast as well.
>  
> !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.1=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE=emb=ii_jumg53fq0|width=227,height=542!
>  
> Also in desc extended the table is 24452111 bytes. It is a Hive table. We 
> always ran into error when this table being broadcast. Below is the sample 
> error
>  
> Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt 
> remote block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) 
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
>  
> Also attached the physical plan if you're interested. One thing to note that, 
> if I turn down autoBroadcastJoinThreshold{color:#00}to 5MB, this query 
> will get successfully executed and default.product NOT broadcasted.{color}
> {color:#00}
> {color}{color:#00}However, when I change to another query that querying 
> even less columns than pervious one, even in 5MB this table still get 
> broadcasted and failed with the same error. I even changed to 1MB and still 
> the same. {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27505) autoBroadcastJoinThreshold including bigger table

2019-04-18 Thread Mike Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Chan updated SPARK-27505:
--
Attachment: explain_plan.txt

> autoBroadcastJoinThreshold including bigger table
> -
>
> Key: SPARK-27505
> URL: https://issues.apache.org/jira/browse/SPARK-27505
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Hive table with Spark 2.3.1 on Azure, using Azure 
> storage as storage layer
>Reporter: Mike Chan
>Priority: Major
> Attachments: explain_plan.txt
>
>
> I'm on a case that when certain table being exposed to broadcast join, the 
> query will eventually failed with remote block error. 
>  
> Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 
> 10485760
> !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.2=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI=emb=ii_jumg5jxd1|width=542,height=66!
>  
> Then we proceed to perform query. In the SQL plan, we found that one table 
> that is 25MB in size is broadcast as well.
>  
> !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.1=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE=emb=ii_jumg53fq0|width=227,height=542!
>  
> Also in desc extended the table is 24452111 bytes. It is a Hive table. We 
> always ran into error when this table being broadcast. Below is the sample 
> error
>  
> Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt 
> remote block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) 
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
>  
> Also attached the physical plan if you're interested. One thing to note that, 
> if I turn down autoBroadcastJoinThreshold{color:#00}to 5MB, this query 
> will get successfully executed and default.product NOT broadcasted.{color}
> {color:#00}
> {color}{color:#00}However, when I change to another query that querying 
> even less columns than pervious one, even in 5MB this table still get 
> broadcasted and failed with the same error. I even changed to 1MB and still 
> the same. {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27505) autoBroadcastJoinThreshold including bigger table

2019-04-18 Thread Mike Chan (JIRA)
Mike Chan created SPARK-27505:
-

 Summary: autoBroadcastJoinThreshold including bigger table
 Key: SPARK-27505
 URL: https://issues.apache.org/jira/browse/SPARK-27505
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 2.3.1
 Environment: Hive table with Spark 2.3.1 on Azure, using Azure storage 
as storage layer
Reporter: Mike Chan


I'm on a case that when certain table being exposed to broadcast join, the 
query will eventually failed with remote block error. 
 
Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 
10485760
!https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.2=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI=emb=ii_jumg5jxd1|width=542,height=66!
 
Then we proceed to perform query. In the SQL plan, we found that one table that 
is 25MB in size is broadcast as well.
 
!https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.1=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE=emb=ii_jumg53fq0|width=227,height=542!
 
Also in desc extended the table is 24452111 bytes. It is a Hive table. We 
always ran into error when this table being broadcast. Below is the sample error
 
Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt remote 
block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 at 
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
 at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
 at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) 
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)

 
Also attached the physical plan if you're interested. One thing to note that, 
if I turn down autoBroadcastJoinThreshold{color:#00}to 5MB, this query will 
get successfully executed and default.product NOT broadcasted.{color}
{color:#00}
{color}{color:#00}However, when I change to another query that querying 
even less columns than pervious one, even in 5MB this table still get 
broadcasted and failed with the same error. I even changed to 1MB and still the 
same. {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25422) flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated (encryption = on) (with replication as stream)

2019-04-17 Thread Mike Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820707#comment-16820707
 ] 

Mike Chan commented on SPARK-25422:
---

Will this problem potentially hitting Spark 2.3.1 as well? I have a new cluster 
at this version and always hitting corrupt remote block when 1 specific table 
involved. 

> flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated 
> (encryption = on) (with replication as stream)
> 
>
> Key: SPARK-25422
> URL: https://issues.apache.org/jira/browse/SPARK-25422
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> stacktrace
> {code}
>  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 7, localhost, executor 1): java.io.IOException: 
> org.apache.spark.SparkException: corrupt remote block broadcast_0_piece0 of 
> broadcast_0: 1651574976 != 1165629262
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1320)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
>   at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:84)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:367)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1347)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:373)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: corrupt remote block 
> broadcast_0_piece0 of broadcast_0: 1651574976 != 1165629262
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:167)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:151)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:231)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1313)
>   ... 13 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27264) spark sql released all executor but the job is not done

2019-03-26 Thread Mike Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Chan resolved SPARK-27264.
---
Resolution: Invalid

Before making further attempt, I want to ensure broadcast join is enabled on 
the cluster. Thank you.

> spark sql released all executor but the job is not done
> ---
>
> Key: SPARK-27264
> URL: https://issues.apache.org/jira/browse/SPARK-27264
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Azure HDinsight spark 2.4 on Azure storage SQL: Read and 
> Join some data and finally write result to a Hive metastore; query executed 
> on jupyterhub; while the pre-migration cluster is a jupyter (non-hub)
>Reporter: Mike Chan
>Priority: Major
>
> I have a spark sql that used to execute < 10 mins now running at 3 hours 
> after a cluster migration and need to deep dive on what it's actually doing. 
> I'm new to spark and please don't mind if I'm asking something unrelated.
> Increased spark.executor.memory but no luck. Env: Azure HDinsight spark 2.4 
> on Azure storage SQL: Read and Join some data and finally write result to a 
> Hive metastore
> The sparl.sql ends with below code: 
> .write.mode("overwrite").saveAsTable("default.mikemiketable")
> Application Behavior: Within the first 15 mins, it loads and complete most 
> tasks (199/200); left only 1 executor process alive and continually to 
> shuffle read / write data. Because now it only leave 1 executor, we need to 
> wait 3 hours until this application finish. 
> [!https://i.stack.imgur.com/6hqvh.png!|https://i.stack.imgur.com/6hqvh.png]
> Left only 1 executor alive 
> [!https://i.stack.imgur.com/55162.png!|https://i.stack.imgur.com/55162.png]
> Not sure what's the executor doing: 
> [!https://i.stack.imgur.com/TwhuX.png!|https://i.stack.imgur.com/TwhuX.png]
> From time to time, we can tell the shuffle read increased: 
> [!https://i.stack.imgur.com/WhF9A.png!|https://i.stack.imgur.com/WhF9A.png]
> Therefore I increased the spark.executor.memory to 20g, but nothing changed. 
> From Ambari and YARN I can tell the cluster has many resources left. 
> [!https://i.stack.imgur.com/pngQA.png!|https://i.stack.imgur.com/pngQA.png]
> Release of almost all executor 
> [!https://i.stack.imgur.com/pA134.png!|https://i.stack.imgur.com/pA134.png]
> Any guidance is greatly appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27264) spark sql released all executor but the job is not done

2019-03-26 Thread Mike Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801437#comment-16801437
 ] 

Mike Chan commented on SPARK-27264:
---

I have repartition the data before transformation and still found the 
performance unreasonably slow. Probably troubleshoot the reason why broadcast 
join is disabled before making another attempt. Thank you.

> spark sql released all executor but the job is not done
> ---
>
> Key: SPARK-27264
> URL: https://issues.apache.org/jira/browse/SPARK-27264
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Azure HDinsight spark 2.4 on Azure storage SQL: Read and 
> Join some data and finally write result to a Hive metastore; query executed 
> on jupyterhub; while the pre-migration cluster is a jupyter (non-hub)
>Reporter: Mike Chan
>Priority: Major
>
> I have a spark sql that used to execute < 10 mins now running at 3 hours 
> after a cluster migration and need to deep dive on what it's actually doing. 
> I'm new to spark and please don't mind if I'm asking something unrelated.
> Increased spark.executor.memory but no luck. Env: Azure HDinsight spark 2.4 
> on Azure storage SQL: Read and Join some data and finally write result to a 
> Hive metastore
> The sparl.sql ends with below code: 
> .write.mode("overwrite").saveAsTable("default.mikemiketable")
> Application Behavior: Within the first 15 mins, it loads and complete most 
> tasks (199/200); left only 1 executor process alive and continually to 
> shuffle read / write data. Because now it only leave 1 executor, we need to 
> wait 3 hours until this application finish. 
> [!https://i.stack.imgur.com/6hqvh.png!|https://i.stack.imgur.com/6hqvh.png]
> Left only 1 executor alive 
> [!https://i.stack.imgur.com/55162.png!|https://i.stack.imgur.com/55162.png]
> Not sure what's the executor doing: 
> [!https://i.stack.imgur.com/TwhuX.png!|https://i.stack.imgur.com/TwhuX.png]
> From time to time, we can tell the shuffle read increased: 
> [!https://i.stack.imgur.com/WhF9A.png!|https://i.stack.imgur.com/WhF9A.png]
> Therefore I increased the spark.executor.memory to 20g, but nothing changed. 
> From Ambari and YARN I can tell the cluster has many resources left. 
> [!https://i.stack.imgur.com/pngQA.png!|https://i.stack.imgur.com/pngQA.png]
> Release of almost all executor 
> [!https://i.stack.imgur.com/pA134.png!|https://i.stack.imgur.com/pA134.png]
> Any guidance is greatly appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27264) spark sql released all executor but the job is not done

2019-03-24 Thread Mike Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16800357#comment-16800357
 ] 

Mike Chan commented on SPARK-27264:
---

[~ajithshetty] Would you mind providing some references for what's snippet in 
spark? Thank you.

 

> spark sql released all executor but the job is not done
> ---
>
> Key: SPARK-27264
> URL: https://issues.apache.org/jira/browse/SPARK-27264
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Azure HDinsight spark 2.4 on Azure storage SQL: Read and 
> Join some data and finally write result to a Hive metastore; query executed 
> on jupyterhub; while the pre-migration cluster is a jupyter (non-hub)
>Reporter: Mike Chan
>Priority: Major
>
> I have a spark sql that used to execute < 10 mins now running at 3 hours 
> after a cluster migration and need to deep dive on what it's actually doing. 
> I'm new to spark and please don't mind if I'm asking something unrelated.
> Increased spark.executor.memory but no luck. Env: Azure HDinsight spark 2.4 
> on Azure storage SQL: Read and Join some data and finally write result to a 
> Hive metastore
> The sparl.sql ends with below code: 
> .write.mode("overwrite").saveAsTable("default.mikemiketable")
> Application Behavior: Within the first 15 mins, it loads and complete most 
> tasks (199/200); left only 1 executor process alive and continually to 
> shuffle read / write data. Because now it only leave 1 executor, we need to 
> wait 3 hours until this application finish. 
> [!https://i.stack.imgur.com/6hqvh.png!|https://i.stack.imgur.com/6hqvh.png]
> Left only 1 executor alive 
> [!https://i.stack.imgur.com/55162.png!|https://i.stack.imgur.com/55162.png]
> Not sure what's the executor doing: 
> [!https://i.stack.imgur.com/TwhuX.png!|https://i.stack.imgur.com/TwhuX.png]
> From time to time, we can tell the shuffle read increased: 
> [!https://i.stack.imgur.com/WhF9A.png!|https://i.stack.imgur.com/WhF9A.png]
> Therefore I increased the spark.executor.memory to 20g, but nothing changed. 
> From Ambari and YARN I can tell the cluster has many resources left. 
> [!https://i.stack.imgur.com/pngQA.png!|https://i.stack.imgur.com/pngQA.png]
> Release of almost all executor 
> [!https://i.stack.imgur.com/pA134.png!|https://i.stack.imgur.com/pA134.png]
> Any guidance is greatly appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27264) spark sql released all executor but the job is not done

2019-03-24 Thread Mike Chan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Chan updated SPARK-27264:
--
Environment: Azure HDinsight spark 2.4 on Azure storage SQL: Read and Join 
some data and finally write result to a Hive metastore; query executed on 
jupyterhub; while the pre-migration cluster is a jupyter (non-hub)  (was: Azure 
HDinsight spark 2.4 on Azure storage SQL: Read and Join some data and finally 
write result to a Hive metastore)

> spark sql released all executor but the job is not done
> ---
>
> Key: SPARK-27264
> URL: https://issues.apache.org/jira/browse/SPARK-27264
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Azure HDinsight spark 2.4 on Azure storage SQL: Read and 
> Join some data and finally write result to a Hive metastore; query executed 
> on jupyterhub; while the pre-migration cluster is a jupyter (non-hub)
>Reporter: Mike Chan
>Priority: Major
>
> I have a spark sql that used to execute < 10 mins now running at 3 hours 
> after a cluster migration and need to deep dive on what it's actually doing. 
> I'm new to spark and please don't mind if I'm asking something unrelated.
> Increased spark.executor.memory but no luck. Env: Azure HDinsight spark 2.4 
> on Azure storage SQL: Read and Join some data and finally write result to a 
> Hive metastore
> The sparl.sql ends with below code: 
> .write.mode("overwrite").saveAsTable("default.mikemiketable")
> Application Behavior: Within the first 15 mins, it loads and complete most 
> tasks (199/200); left only 1 executor process alive and continually to 
> shuffle read / write data. Because now it only leave 1 executor, we need to 
> wait 3 hours until this application finish. 
> [!https://i.stack.imgur.com/6hqvh.png!|https://i.stack.imgur.com/6hqvh.png]
> Left only 1 executor alive 
> [!https://i.stack.imgur.com/55162.png!|https://i.stack.imgur.com/55162.png]
> Not sure what's the executor doing: 
> [!https://i.stack.imgur.com/TwhuX.png!|https://i.stack.imgur.com/TwhuX.png]
> From time to time, we can tell the shuffle read increased: 
> [!https://i.stack.imgur.com/WhF9A.png!|https://i.stack.imgur.com/WhF9A.png]
> Therefore I increased the spark.executor.memory to 20g, but nothing changed. 
> From Ambari and YARN I can tell the cluster has many resources left. 
> [!https://i.stack.imgur.com/pngQA.png!|https://i.stack.imgur.com/pngQA.png]
> Release of almost all executor 
> [!https://i.stack.imgur.com/pA134.png!|https://i.stack.imgur.com/pA134.png]
> Any guidance is greatly appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27264) spark sql released all executor but the job is not done

2019-03-24 Thread Mike Chan (JIRA)
Mike Chan created SPARK-27264:
-

 Summary: spark sql released all executor but the job is not done
 Key: SPARK-27264
 URL: https://issues.apache.org/jira/browse/SPARK-27264
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 2.4.0
 Environment: Azure HDinsight spark 2.4 on Azure storage SQL: Read and 
Join some data and finally write result to a Hive metastore
Reporter: Mike Chan


I have a spark sql that used to execute < 10 mins now running at 3 hours after 
a cluster migration and need to deep dive on what it's actually doing. I'm new 
to spark and please don't mind if I'm asking something unrelated.

Increased spark.executor.memory but no luck. Env: Azure HDinsight spark 2.4 on 
Azure storage SQL: Read and Join some data and finally write result to a Hive 
metastore

The sparl.sql ends with below code: 
.write.mode("overwrite").saveAsTable("default.mikemiketable")

Application Behavior: Within the first 15 mins, it loads and complete most 
tasks (199/200); left only 1 executor process alive and continually to shuffle 
read / write data. Because now it only leave 1 executor, we need to wait 3 
hours until this application finish. 
[!https://i.stack.imgur.com/6hqvh.png!|https://i.stack.imgur.com/6hqvh.png]

Left only 1 executor alive 
[!https://i.stack.imgur.com/55162.png!|https://i.stack.imgur.com/55162.png]

Not sure what's the executor doing: 
[!https://i.stack.imgur.com/TwhuX.png!|https://i.stack.imgur.com/TwhuX.png]

>From time to time, we can tell the shuffle read increased: 
>[!https://i.stack.imgur.com/WhF9A.png!|https://i.stack.imgur.com/WhF9A.png]

Therefore I increased the spark.executor.memory to 20g, but nothing changed. 
From Ambari and YARN I can tell the cluster has many resources left. 
[!https://i.stack.imgur.com/pngQA.png!|https://i.stack.imgur.com/pngQA.png]

Release of almost all executor 
[!https://i.stack.imgur.com/pA134.png!|https://i.stack.imgur.com/pA134.png]

Any guidance is greatly appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org