[jira] [Commented] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825798#comment-16825798 ] Mike Chan commented on SPARK-13510: --- Thanks man > Shuffle may throw FetchFailedException: Direct buffer memory > > > Key: SPARK-13510 > URL: https://issues.apache.org/jira/browse/SPARK-13510 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Hong Shen >Priority: Major > Attachments: spark-13510.diff > > > In our cluster, when I test spark-1.6.0 with a sql, it throw exception and > failed. > {code} > 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request > for 1 blocks (915.4 MB) from 10.196.134.220:7337 > 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch > from 10.196.134.220:7337 (executor id 122) > 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 > to /10.196.134.220:7337 > 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in > connection from /10.196.134.220:7337 > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:744) > 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from /10.196.134.220:7337 is closed > 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block > shuffle_3_81_2, and will not retry (0 retries) > {code} > The reason is that when shuffle a big block(like 1G), task will allocate > the same memory, it will easily throw "FetchFailedException: Direct buffer > memory". > If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will > throw > {code} > java.lang.OutOfMemoryError: Java heap space > at > io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) > at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > {code} > > In mapreduce shuffle, it will firstly judge whether the block can cache in > memery, but spark doesn't. > If the block is more than we can cache in memory, we should write to disk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825721#comment-16825721 ] Mike Chan commented on SPARK-13510: --- Hi [~belvey] I'm having similar issue and our "spark.maxRemoteBlockSizeFetchToMem" is at 188. From the forums I can tell this parameter should be set below 2GB. How do you set your parameter? Should it be "2g" or 2 * 1024 * 1024 * 1024 = 2147483648? I'm at Spark 2.3 on Azure > Shuffle may throw FetchFailedException: Direct buffer memory > > > Key: SPARK-13510 > URL: https://issues.apache.org/jira/browse/SPARK-13510 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Hong Shen >Priority: Major > Attachments: spark-13510.diff > > > In our cluster, when I test spark-1.6.0 with a sql, it throw exception and > failed. > {code} > 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request > for 1 blocks (915.4 MB) from 10.196.134.220:7337 > 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch > from 10.196.134.220:7337 (executor id 122) > 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 > to /10.196.134.220:7337 > 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in > connection from /10.196.134.220:7337 > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:744) > 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from /10.196.134.220:7337 is closed > 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block > shuffle_3_81_2, and will not retry (0 retries) > {code} > The reason is that when shuffle a big block(like 1G), task will allocate > the same memory, it will easily throw "FetchFailedException: Direct buffer > memory". > If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will > throw > {code} > java.lang.OutOfMemoryError: Java heap space > at > io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) > at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > {code} > > In mapreduce shuffle, it will firstly judge whether the block can cache in > memery, but spark doesn't. > If the block is more than we can cache in memory, we should write to disk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27505) autoBroadcastJoinThreshold including bigger table
[ https://issues.apache.org/jira/browse/SPARK-27505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823637#comment-16823637 ] Mike Chan commented on SPARK-27505: --- You mind sharing any info on self-reproducer? Tried to google myself but nothing came through. Thank you. > autoBroadcastJoinThreshold including bigger table > - > > Key: SPARK-27505 > URL: https://issues.apache.org/jira/browse/SPARK-27505 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: Hive table with Spark 2.3.1 on Azure, using Azure > storage as storage layer >Reporter: Mike Chan >Priority: Major > Attachments: explain_plan.txt > > > I'm on a case that when certain table being exposed to broadcast join, the > query will eventually failed with remote block error. > > Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely > 10485760 > !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.2=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI=emb=ii_jumg5jxd1|width=542,height=66! > > Then we proceed to perform query. In the SQL plan, we found that one table > that is 25MB in size is broadcast as well. > > !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.1=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE=emb=ii_jumg53fq0|width=227,height=542! > > Also in desc extended the table is 24452111 bytes. It is a Hive table. We > always ran into error when this table being broadcast. Below is the sample > error > > Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt > remote block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > > Also attached the physical plan if you're interested. One thing to note that, > if I turn down autoBroadcastJoinThreshold{color:#00}to 5MB, this query > will get successfully executed and default.product NOT broadcasted.{color} > {color:#00} > {color}{color:#00}However, when I change to another query that querying > even less columns than pervious one, even in 5MB this table still get > broadcasted and failed with the same error. I even changed to 1MB and still > the same. {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27505) autoBroadcastJoinThreshold including bigger table
[ https://issues.apache.org/jira/browse/SPARK-27505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Chan updated SPARK-27505: -- Issue Type: Bug (was: Question) > autoBroadcastJoinThreshold including bigger table > - > > Key: SPARK-27505 > URL: https://issues.apache.org/jira/browse/SPARK-27505 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: Hive table with Spark 2.3.1 on Azure, using Azure > storage as storage layer >Reporter: Mike Chan >Priority: Major > Attachments: explain_plan.txt > > > I'm on a case that when certain table being exposed to broadcast join, the > query will eventually failed with remote block error. > > Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely > 10485760 > !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.2=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI=emb=ii_jumg5jxd1|width=542,height=66! > > Then we proceed to perform query. In the SQL plan, we found that one table > that is 25MB in size is broadcast as well. > > !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.1=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE=emb=ii_jumg53fq0|width=227,height=542! > > Also in desc extended the table is 24452111 bytes. It is a Hive table. We > always ran into error when this table being broadcast. Below is the sample > error > > Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt > remote block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > > Also attached the physical plan if you're interested. One thing to note that, > if I turn down autoBroadcastJoinThreshold{color:#00}to 5MB, this query > will get successfully executed and default.product NOT broadcasted.{color} > {color:#00} > {color}{color:#00}However, when I change to another query that querying > even less columns than pervious one, even in 5MB this table still get > broadcasted and failed with the same error. I even changed to 1MB and still > the same. {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27505) autoBroadcastJoinThreshold including bigger table
[ https://issues.apache.org/jira/browse/SPARK-27505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820951#comment-16820951 ] Mike Chan commented on SPARK-27505: --- Table desc extended result: Statistics |24452111 bytes > autoBroadcastJoinThreshold including bigger table > - > > Key: SPARK-27505 > URL: https://issues.apache.org/jira/browse/SPARK-27505 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.3.1 > Environment: Hive table with Spark 2.3.1 on Azure, using Azure > storage as storage layer >Reporter: Mike Chan >Priority: Major > Attachments: explain_plan.txt > > > I'm on a case that when certain table being exposed to broadcast join, the > query will eventually failed with remote block error. > > Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely > 10485760 > !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.2=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI=emb=ii_jumg5jxd1|width=542,height=66! > > Then we proceed to perform query. In the SQL plan, we found that one table > that is 25MB in size is broadcast as well. > > !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.1=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE=emb=ii_jumg53fq0|width=227,height=542! > > Also in desc extended the table is 24452111 bytes. It is a Hive table. We > always ran into error when this table being broadcast. Below is the sample > error > > Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt > remote block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > > Also attached the physical plan if you're interested. One thing to note that, > if I turn down autoBroadcastJoinThreshold{color:#00}to 5MB, this query > will get successfully executed and default.product NOT broadcasted.{color} > {color:#00} > {color}{color:#00}However, when I change to another query that querying > even less columns than pervious one, even in 5MB this table still get > broadcasted and failed with the same error. I even changed to 1MB and still > the same. {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27505) autoBroadcastJoinThreshold including bigger table
[ https://issues.apache.org/jira/browse/SPARK-27505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Chan updated SPARK-27505: -- Attachment: explain_plan.txt > autoBroadcastJoinThreshold including bigger table > - > > Key: SPARK-27505 > URL: https://issues.apache.org/jira/browse/SPARK-27505 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.3.1 > Environment: Hive table with Spark 2.3.1 on Azure, using Azure > storage as storage layer >Reporter: Mike Chan >Priority: Major > Attachments: explain_plan.txt > > > I'm on a case that when certain table being exposed to broadcast join, the > query will eventually failed with remote block error. > > Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely > 10485760 > !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.2=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI=emb=ii_jumg5jxd1|width=542,height=66! > > Then we proceed to perform query. In the SQL plan, we found that one table > that is 25MB in size is broadcast as well. > > !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.1=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE=emb=ii_jumg53fq0|width=227,height=542! > > Also in desc extended the table is 24452111 bytes. It is a Hive table. We > always ran into error when this table being broadcast. Below is the sample > error > > Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt > remote block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > > Also attached the physical plan if you're interested. One thing to note that, > if I turn down autoBroadcastJoinThreshold{color:#00}to 5MB, this query > will get successfully executed and default.product NOT broadcasted.{color} > {color:#00} > {color}{color:#00}However, when I change to another query that querying > even less columns than pervious one, even in 5MB this table still get > broadcasted and failed with the same error. I even changed to 1MB and still > the same. {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27505) autoBroadcastJoinThreshold including bigger table
Mike Chan created SPARK-27505: - Summary: autoBroadcastJoinThreshold including bigger table Key: SPARK-27505 URL: https://issues.apache.org/jira/browse/SPARK-27505 Project: Spark Issue Type: Question Components: PySpark Affects Versions: 2.3.1 Environment: Hive table with Spark 2.3.1 on Azure, using Azure storage as storage layer Reporter: Mike Chan I'm on a case that when certain table being exposed to broadcast join, the query will eventually failed with remote block error. Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 10485760 !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.2=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI=emb=ii_jumg5jxd1|width=542,height=66! Then we proceed to perform query. In the SQL plan, we found that one table that is 25MB in size is broadcast as well. !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.1=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE=emb=ii_jumg53fq0|width=227,height=542! Also in desc extended the table is 24452111 bytes. It is a Hive table. We always ran into error when this table being broadcast. Below is the sample error Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt remote block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) Also attached the physical plan if you're interested. One thing to note that, if I turn down autoBroadcastJoinThreshold{color:#00}to 5MB, this query will get successfully executed and default.product NOT broadcasted.{color} {color:#00} {color}{color:#00}However, when I change to another query that querying even less columns than pervious one, even in 5MB this table still get broadcasted and failed with the same error. I even changed to 1MB and still the same. {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25422) flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated (encryption = on) (with replication as stream)
[ https://issues.apache.org/jira/browse/SPARK-25422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820707#comment-16820707 ] Mike Chan commented on SPARK-25422: --- Will this problem potentially hitting Spark 2.3.1 as well? I have a new cluster at this version and always hitting corrupt remote block when 1 specific table involved. > flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated > (encryption = on) (with replication as stream) > > > Key: SPARK-25422 > URL: https://issues.apache.org/jira/browse/SPARK-25422 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Imran Rashid >Priority: Major > Fix For: 2.4.0 > > > stacktrace > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 > (TID 7, localhost, executor 1): java.io.IOException: > org.apache.spark.SparkException: corrupt remote block broadcast_0_piece0 of > broadcast_0: 1651574976 != 1165629262 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1320) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:84) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:367) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1347) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:373) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: corrupt remote block > broadcast_0_piece0 of broadcast_0: 1651574976 != 1165629262 > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:167) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:151) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:231) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1313) > ... 13 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27264) spark sql released all executor but the job is not done
[ https://issues.apache.org/jira/browse/SPARK-27264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Chan resolved SPARK-27264. --- Resolution: Invalid Before making further attempt, I want to ensure broadcast join is enabled on the cluster. Thank you. > spark sql released all executor but the job is not done > --- > > Key: SPARK-27264 > URL: https://issues.apache.org/jira/browse/SPARK-27264 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.4.0 > Environment: Azure HDinsight spark 2.4 on Azure storage SQL: Read and > Join some data and finally write result to a Hive metastore; query executed > on jupyterhub; while the pre-migration cluster is a jupyter (non-hub) >Reporter: Mike Chan >Priority: Major > > I have a spark sql that used to execute < 10 mins now running at 3 hours > after a cluster migration and need to deep dive on what it's actually doing. > I'm new to spark and please don't mind if I'm asking something unrelated. > Increased spark.executor.memory but no luck. Env: Azure HDinsight spark 2.4 > on Azure storage SQL: Read and Join some data and finally write result to a > Hive metastore > The sparl.sql ends with below code: > .write.mode("overwrite").saveAsTable("default.mikemiketable") > Application Behavior: Within the first 15 mins, it loads and complete most > tasks (199/200); left only 1 executor process alive and continually to > shuffle read / write data. Because now it only leave 1 executor, we need to > wait 3 hours until this application finish. > [!https://i.stack.imgur.com/6hqvh.png!|https://i.stack.imgur.com/6hqvh.png] > Left only 1 executor alive > [!https://i.stack.imgur.com/55162.png!|https://i.stack.imgur.com/55162.png] > Not sure what's the executor doing: > [!https://i.stack.imgur.com/TwhuX.png!|https://i.stack.imgur.com/TwhuX.png] > From time to time, we can tell the shuffle read increased: > [!https://i.stack.imgur.com/WhF9A.png!|https://i.stack.imgur.com/WhF9A.png] > Therefore I increased the spark.executor.memory to 20g, but nothing changed. > From Ambari and YARN I can tell the cluster has many resources left. > [!https://i.stack.imgur.com/pngQA.png!|https://i.stack.imgur.com/pngQA.png] > Release of almost all executor > [!https://i.stack.imgur.com/pA134.png!|https://i.stack.imgur.com/pA134.png] > Any guidance is greatly appreciated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27264) spark sql released all executor but the job is not done
[ https://issues.apache.org/jira/browse/SPARK-27264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801437#comment-16801437 ] Mike Chan commented on SPARK-27264: --- I have repartition the data before transformation and still found the performance unreasonably slow. Probably troubleshoot the reason why broadcast join is disabled before making another attempt. Thank you. > spark sql released all executor but the job is not done > --- > > Key: SPARK-27264 > URL: https://issues.apache.org/jira/browse/SPARK-27264 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.4.0 > Environment: Azure HDinsight spark 2.4 on Azure storage SQL: Read and > Join some data and finally write result to a Hive metastore; query executed > on jupyterhub; while the pre-migration cluster is a jupyter (non-hub) >Reporter: Mike Chan >Priority: Major > > I have a spark sql that used to execute < 10 mins now running at 3 hours > after a cluster migration and need to deep dive on what it's actually doing. > I'm new to spark and please don't mind if I'm asking something unrelated. > Increased spark.executor.memory but no luck. Env: Azure HDinsight spark 2.4 > on Azure storage SQL: Read and Join some data and finally write result to a > Hive metastore > The sparl.sql ends with below code: > .write.mode("overwrite").saveAsTable("default.mikemiketable") > Application Behavior: Within the first 15 mins, it loads and complete most > tasks (199/200); left only 1 executor process alive and continually to > shuffle read / write data. Because now it only leave 1 executor, we need to > wait 3 hours until this application finish. > [!https://i.stack.imgur.com/6hqvh.png!|https://i.stack.imgur.com/6hqvh.png] > Left only 1 executor alive > [!https://i.stack.imgur.com/55162.png!|https://i.stack.imgur.com/55162.png] > Not sure what's the executor doing: > [!https://i.stack.imgur.com/TwhuX.png!|https://i.stack.imgur.com/TwhuX.png] > From time to time, we can tell the shuffle read increased: > [!https://i.stack.imgur.com/WhF9A.png!|https://i.stack.imgur.com/WhF9A.png] > Therefore I increased the spark.executor.memory to 20g, but nothing changed. > From Ambari and YARN I can tell the cluster has many resources left. > [!https://i.stack.imgur.com/pngQA.png!|https://i.stack.imgur.com/pngQA.png] > Release of almost all executor > [!https://i.stack.imgur.com/pA134.png!|https://i.stack.imgur.com/pA134.png] > Any guidance is greatly appreciated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27264) spark sql released all executor but the job is not done
[ https://issues.apache.org/jira/browse/SPARK-27264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16800357#comment-16800357 ] Mike Chan commented on SPARK-27264: --- [~ajithshetty] Would you mind providing some references for what's snippet in spark? Thank you. > spark sql released all executor but the job is not done > --- > > Key: SPARK-27264 > URL: https://issues.apache.org/jira/browse/SPARK-27264 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.4.0 > Environment: Azure HDinsight spark 2.4 on Azure storage SQL: Read and > Join some data and finally write result to a Hive metastore; query executed > on jupyterhub; while the pre-migration cluster is a jupyter (non-hub) >Reporter: Mike Chan >Priority: Major > > I have a spark sql that used to execute < 10 mins now running at 3 hours > after a cluster migration and need to deep dive on what it's actually doing. > I'm new to spark and please don't mind if I'm asking something unrelated. > Increased spark.executor.memory but no luck. Env: Azure HDinsight spark 2.4 > on Azure storage SQL: Read and Join some data and finally write result to a > Hive metastore > The sparl.sql ends with below code: > .write.mode("overwrite").saveAsTable("default.mikemiketable") > Application Behavior: Within the first 15 mins, it loads and complete most > tasks (199/200); left only 1 executor process alive and continually to > shuffle read / write data. Because now it only leave 1 executor, we need to > wait 3 hours until this application finish. > [!https://i.stack.imgur.com/6hqvh.png!|https://i.stack.imgur.com/6hqvh.png] > Left only 1 executor alive > [!https://i.stack.imgur.com/55162.png!|https://i.stack.imgur.com/55162.png] > Not sure what's the executor doing: > [!https://i.stack.imgur.com/TwhuX.png!|https://i.stack.imgur.com/TwhuX.png] > From time to time, we can tell the shuffle read increased: > [!https://i.stack.imgur.com/WhF9A.png!|https://i.stack.imgur.com/WhF9A.png] > Therefore I increased the spark.executor.memory to 20g, but nothing changed. > From Ambari and YARN I can tell the cluster has many resources left. > [!https://i.stack.imgur.com/pngQA.png!|https://i.stack.imgur.com/pngQA.png] > Release of almost all executor > [!https://i.stack.imgur.com/pA134.png!|https://i.stack.imgur.com/pA134.png] > Any guidance is greatly appreciated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27264) spark sql released all executor but the job is not done
[ https://issues.apache.org/jira/browse/SPARK-27264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Chan updated SPARK-27264: -- Environment: Azure HDinsight spark 2.4 on Azure storage SQL: Read and Join some data and finally write result to a Hive metastore; query executed on jupyterhub; while the pre-migration cluster is a jupyter (non-hub) (was: Azure HDinsight spark 2.4 on Azure storage SQL: Read and Join some data and finally write result to a Hive metastore) > spark sql released all executor but the job is not done > --- > > Key: SPARK-27264 > URL: https://issues.apache.org/jira/browse/SPARK-27264 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.4.0 > Environment: Azure HDinsight spark 2.4 on Azure storage SQL: Read and > Join some data and finally write result to a Hive metastore; query executed > on jupyterhub; while the pre-migration cluster is a jupyter (non-hub) >Reporter: Mike Chan >Priority: Major > > I have a spark sql that used to execute < 10 mins now running at 3 hours > after a cluster migration and need to deep dive on what it's actually doing. > I'm new to spark and please don't mind if I'm asking something unrelated. > Increased spark.executor.memory but no luck. Env: Azure HDinsight spark 2.4 > on Azure storage SQL: Read and Join some data and finally write result to a > Hive metastore > The sparl.sql ends with below code: > .write.mode("overwrite").saveAsTable("default.mikemiketable") > Application Behavior: Within the first 15 mins, it loads and complete most > tasks (199/200); left only 1 executor process alive and continually to > shuffle read / write data. Because now it only leave 1 executor, we need to > wait 3 hours until this application finish. > [!https://i.stack.imgur.com/6hqvh.png!|https://i.stack.imgur.com/6hqvh.png] > Left only 1 executor alive > [!https://i.stack.imgur.com/55162.png!|https://i.stack.imgur.com/55162.png] > Not sure what's the executor doing: > [!https://i.stack.imgur.com/TwhuX.png!|https://i.stack.imgur.com/TwhuX.png] > From time to time, we can tell the shuffle read increased: > [!https://i.stack.imgur.com/WhF9A.png!|https://i.stack.imgur.com/WhF9A.png] > Therefore I increased the spark.executor.memory to 20g, but nothing changed. > From Ambari and YARN I can tell the cluster has many resources left. > [!https://i.stack.imgur.com/pngQA.png!|https://i.stack.imgur.com/pngQA.png] > Release of almost all executor > [!https://i.stack.imgur.com/pA134.png!|https://i.stack.imgur.com/pA134.png] > Any guidance is greatly appreciated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27264) spark sql released all executor but the job is not done
Mike Chan created SPARK-27264: - Summary: spark sql released all executor but the job is not done Key: SPARK-27264 URL: https://issues.apache.org/jira/browse/SPARK-27264 Project: Spark Issue Type: Question Components: SQL Affects Versions: 2.4.0 Environment: Azure HDinsight spark 2.4 on Azure storage SQL: Read and Join some data and finally write result to a Hive metastore Reporter: Mike Chan I have a spark sql that used to execute < 10 mins now running at 3 hours after a cluster migration and need to deep dive on what it's actually doing. I'm new to spark and please don't mind if I'm asking something unrelated. Increased spark.executor.memory but no luck. Env: Azure HDinsight spark 2.4 on Azure storage SQL: Read and Join some data and finally write result to a Hive metastore The sparl.sql ends with below code: .write.mode("overwrite").saveAsTable("default.mikemiketable") Application Behavior: Within the first 15 mins, it loads and complete most tasks (199/200); left only 1 executor process alive and continually to shuffle read / write data. Because now it only leave 1 executor, we need to wait 3 hours until this application finish. [!https://i.stack.imgur.com/6hqvh.png!|https://i.stack.imgur.com/6hqvh.png] Left only 1 executor alive [!https://i.stack.imgur.com/55162.png!|https://i.stack.imgur.com/55162.png] Not sure what's the executor doing: [!https://i.stack.imgur.com/TwhuX.png!|https://i.stack.imgur.com/TwhuX.png] >From time to time, we can tell the shuffle read increased: >[!https://i.stack.imgur.com/WhF9A.png!|https://i.stack.imgur.com/WhF9A.png] Therefore I increased the spark.executor.memory to 20g, but nothing changed. From Ambari and YARN I can tell the cluster has many resources left. [!https://i.stack.imgur.com/pngQA.png!|https://i.stack.imgur.com/pngQA.png] Release of almost all executor [!https://i.stack.imgur.com/pA134.png!|https://i.stack.imgur.com/pA134.png] Any guidance is greatly appreciated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org