[jira] [Updated] (SPARK-30002) Reuse SparkSession in pyspark via Gateway
[ https://issues.apache.org/jira/browse/SPARK-30002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaushal Prajapati updated SPARK-30002: -- Description: In PySpark, we create SparkContext via user spark configurations or the default ones, and it gets launched through py4j gateway internally. Let's say if I have launched py4j gateway from another application then to communicate with the same py4j gateway I have to set below configuration:- {code:java} export PYSPARK_GATEWAY_PORT=12345 export PYSPARK_GATEWAY_SECRET=*** {code} So when PySpark tries to create its own SparkContext after the communication has been set up, it doesn't check whether there is any available SparkContext in the same JVM. Current code snippet:- {code:java} def _initialize_context(self, jconf): """ Initialize SparkContext in function to allow subclass specific initialization """ return self._jvm.JavaSparkContext(jconf){code} After changing it to the following, it works fine for me. {code:java} def _initialize_context(self, jconf): """ Initialize SparkContext in function to allow subclass specific initialization """ return self._jvm.JavaSparkContext(self._jvm.org.apache.spark.SparkContext.getOrCreate(jconf)){code} It looks like a good use case for improvement. was: In PySpark, we create SparkContext via user spark configurations or the default ones, and it gets launched through py4j gateway internally. Let's say if I have launched py4j gateway from another application then to communicate with the same py4j gateway I have to set below configuration:- {code:java} export PYSPARK_GATEWAY_PORT=12345 export PYSPARK_GATEWAY_SECRET=*** {code} So when PySpark tries to create its own SparkContext after the communication has been set up, it doesn't check whether there is any available SparkContext in the same JVM. Current code snippet:- {code:java} def _initialize_context(self, jconf): """ Initialize SparkContext in function to allow subclass specific initialization """ return self._jvm.JavaSparkContext(jconf){code} If we change it to the following, it works fine for me. {code:java} def _initialize_context(self, jconf): """ Initialize SparkContext in function to allow subclass specific initialization """ return self._jvm.JavaSparkContext(self._jvm.org.apache.spark.SparkContext.getOrCreate(jconf)){code} It looks like a good use case for improvement. > Reuse SparkSession in pyspark via Gateway > - > > Key: SPARK-30002 > URL: https://issues.apache.org/jira/browse/SPARK-30002 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0, 2.4.4 >Reporter: Kaushal Prajapati >Priority: Minor > > In PySpark, we create SparkContext via user spark configurations or the > default ones, and it gets launched through py4j gateway internally. > Let's say if I have launched py4j gateway from another application then to > communicate with the same py4j gateway I have to set below configuration:- > > {code:java} > export PYSPARK_GATEWAY_PORT=12345 > export PYSPARK_GATEWAY_SECRET=*** > {code} > > So when PySpark tries to create its own SparkContext after the communication > has been set up, it doesn't check whether there is any available SparkContext > in the same JVM. > Current code snippet:- > > {code:java} > def _initialize_context(self, jconf): > """ > Initialize SparkContext in function to allow subclass specific > initialization > """ > return self._jvm.JavaSparkContext(jconf){code} > > > After changing it to the following, it works fine for me. > {code:java} > def _initialize_context(self, jconf): > """ > Initialize SparkContext in function to allow subclass specific > initialization > """ > return > self._jvm.JavaSparkContext(self._jvm.org.apache.spark.SparkContext.getOrCreate(jconf)){code} > > It looks like a good use case for improvement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30002) Reuse SparkSession in pyspark via Gateway
Kaushal Prajapati created SPARK-30002: - Summary: Reuse SparkSession in pyspark via Gateway Key: SPARK-30002 URL: https://issues.apache.org/jira/browse/SPARK-30002 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.4.4, 2.4.0 Reporter: Kaushal Prajapati In PySpark, we create SparkContext via user spark configurations or the default ones, and it gets launched through py4j gateway internally. Let's say if I have launched py4j gateway from another application then to communicate with the same py4j gateway I have to set below configuration:- {code:java} export PYSPARK_GATEWAY_PORT=12345 export PYSPARK_GATEWAY_SECRET=*** {code} So when PySpark tries to create its own SparkContext after the communication has been set up, it doesn't check whether there is any available SparkContext in the same JVM. Current code snippet:- {code:java} def _initialize_context(self, jconf): """ Initialize SparkContext in function to allow subclass specific initialization """ return self._jvm.JavaSparkContext(jconf){code} If we change it to the following, it works fine for me. {code:java} def _initialize_context(self, jconf): """ Initialize SparkContext in function to allow subclass specific initialization """ return self._jvm.JavaSparkContext(self._jvm.org.apache.spark.SparkContext.getOrCreate(jconf)){code} It looks like a good use case for improvement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaushal Prajapati reopened SPARK-22458: --- It's working with same configs using Spark 2.1, why not on Spark 2.2 and why Spark 2.2 demand extra memory and how much extra? > OutOfDirectMemoryError with Spark 2.2 > - > > Key: SPARK-22458 > URL: https://issues.apache.org/jira/browse/SPARK-22458 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL, YARN >Affects Versions: 2.2.0 >Reporter: Kaushal Prajapati >Priority: Blocker > > We were using Spark 2.1 from last 6 months to execute multiple spark jobs > that is running 15 hour long for 50+ TB of source data with below > configurations successfully. > {quote}spark.master yarn > spark.driver.cores10 > spark.driver.maxResultSize5g > spark.driver.memory 20g > spark.executor.cores 5 > spark.executor.extraJavaOptions *-XX:+UseG1GC > -Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 > *-XX:MaxDirectMemorySize=2048m* > -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 > spark.driver.extraJavaOptions > *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* > -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 > spark.executor.instances 30 > spark.executor.memory 30g > *spark.kryoserializer.buffer.max 512m* > spark.network.timeout 12000s > spark.serializer > org.apache.spark.serializer.KryoSerializer > spark.shuffle.io.preferDirectBufs false > spark.sql.catalogImplementation hive > spark.sql.shuffle.partitions 5000 > spark.yarn.driver.memoryOverhead 1536 > spark.yarn.executor.memoryOverhead4096 > spark.core.connection.ack.wait.timeout600s > spark.scheduler.maxRegisteredResourcesWaitingTime 15s > spark.sql.hive.filesourcePartitionFileCacheSize 524288000 > spark.dynamicAllocation.executorIdleTimeout 3s > spark.dynamicAllocation.enabled true > spark.hadoop.yarn.timeline-service.enabledfalse > spark.shuffle.service.enabled true > spark.yarn.am.extraJavaOptions*-Dhdp.version=2.5.3.0-37 > -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote} > Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes > using latest version. But we started facing DirectBuffer outOfMemory error > and exceeding memory limits for executor memoryOverhead issue. To fix that we > started tweaking multiple properties but still issue persists. Relevant > information is shared below > Please let me any other details is requried, > > Snapshot for DirectMemory Error Stacktrace :- > {code:java} > 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in > stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): > FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), > shuffleId=7, mapId=141, reduceId=3372, message= > org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 > byte(s) of direct memory (used: 1073699840, max: 1073741824) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeSta
[jira] [Commented] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240537#comment-16240537 ] Kaushal Prajapati commented on SPARK-22458: --- yes, but we tried multiple times, same code is working with Spark 2.1 not with Spark 2.2. > OutOfDirectMemoryError with Spark 2.2 > - > > Key: SPARK-22458 > URL: https://issues.apache.org/jira/browse/SPARK-22458 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL, YARN >Affects Versions: 2.2.0 >Reporter: Kaushal Prajapati >Priority: Blocker > > We were using Spark 2.1 from last 6 months to execute multiple spark jobs > that is running 15 hour long for 50+ TB of source data with below > configurations successfully. > {quote}spark.master yarn > spark.driver.cores10 > spark.driver.maxResultSize5g > spark.driver.memory 20g > spark.executor.cores 5 > spark.executor.extraJavaOptions *-XX:+UseG1GC > -Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 > *-XX:MaxDirectMemorySize=2048m* > -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 > spark.driver.extraJavaOptions > *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* > -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 > spark.executor.instances 30 > spark.executor.memory 30g > *spark.kryoserializer.buffer.max 512m* > spark.network.timeout 12000s > spark.serializer > org.apache.spark.serializer.KryoSerializer > spark.shuffle.io.preferDirectBufs false > spark.sql.catalogImplementation hive > spark.sql.shuffle.partitions 5000 > spark.yarn.driver.memoryOverhead 1536 > spark.yarn.executor.memoryOverhead4096 > spark.core.connection.ack.wait.timeout600s > spark.scheduler.maxRegisteredResourcesWaitingTime 15s > spark.sql.hive.filesourcePartitionFileCacheSize 524288000 > spark.dynamicAllocation.executorIdleTimeout 3s > spark.dynamicAllocation.enabled true > spark.hadoop.yarn.timeline-service.enabledfalse > spark.shuffle.service.enabled true > spark.yarn.am.extraJavaOptions*-Dhdp.version=2.5.3.0-37 > -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote} > Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes > using latest version. But we started facing DirectBuffer outOfMemory error > and exceeding memory limits for executor memoryOverhead issue. To fix that we > started tweaking multiple properties but still issue persists. Relevant > information is shared below > Please let me any other details is requried, > > Snapshot for DirectMemory Error Stacktrace :- > {code:java} > 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in > stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): > FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), > shuffleId=7, mapId=141, reduceId=3372, message= > org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 > byte(s) of direct memory (used: 1073699840, max: 1073741824) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$
[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaushal Prajapati updated SPARK-22458: -- Description: We were using Spark 2.1 from last 6 months to execute multiple spark jobs that is running 15 hour long for 50+ TB of source data with below configurations successfully. {quote}spark.master yarn spark.driver.cores10 spark.driver.maxResultSize5g spark.driver.memory 20g spark.executor.cores 5 spark.executor.extraJavaOptions *-XX:+UseG1GC -Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 *-XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.driver.extraJavaOptions *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.executor.instances 30 spark.executor.memory 30g *spark.kryoserializer.buffer.max 512m* spark.network.timeout 12000s spark.serializer org.apache.spark.serializer.KryoSerializer spark.shuffle.io.preferDirectBufs false spark.sql.catalogImplementation hive spark.sql.shuffle.partitions 5000 spark.yarn.driver.memoryOverhead 1536 spark.yarn.executor.memoryOverhead4096 spark.core.connection.ack.wait.timeout600s spark.scheduler.maxRegisteredResourcesWaitingTime 15s spark.sql.hive.filesourcePartitionFileCacheSize 524288000 spark.dynamicAllocation.executorIdleTimeout 3s spark.dynamicAllocation.enabled true spark.hadoop.yarn.timeline-service.enabledfalse spark.shuffle.service.enabled true spark.yarn.am.extraJavaOptions*-Dhdp.version=2.5.3.0-37 -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote} Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes using latest version. But we started facing DirectBuffer outOfMemory error and exceeding memory limits for executor memoryOverhead issue. To fix that we started tweaking multiple properties but still issue persists. Relevant information is shared below Please let me any other details is requried, Snapshot for DirectMemory Error Stacktrace :- {code:java} 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), shuffleId=7, mapId=141, reduceId=3372, message= org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(Unsafe
[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaushal Prajapati updated SPARK-22458: -- Description: We were using Spark 2.1 from last 6 months to execute multiple spark jobs that is running 15 hour long for 50+ TB of source data with below configurations successfully. {noformat} spark.master yarn spark.driver.cores10 spark.driver.maxResultSize5g spark.driver.memory 20g spark.executor.cores 5 spark.executor.extraJavaOptions -XX:+UseG1GC {color:red}-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6{color} -XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.driver.extraJavaOptions *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.executor.instances 30 spark.executor.memory 30g *spark.kryoserializer.buffer.max 512m* spark.network.timeout 12000s spark.serializer org.apache.spark.serializer.KryoSerializer spark.shuffle.io.preferDirectBufs false spark.sql.catalogImplementation hive spark.sql.shuffle.partitions 5000 spark.yarn.driver.memoryOverhead 1536 spark.yarn.executor.memoryOverhead4096 spark.core.connection.ack.wait.timeout600s spark.scheduler.maxRegisteredResourcesWaitingTime 15s spark.sql.hive.filesourcePartitionFileCacheSize 524288000 spark.dynamicAllocation.executorIdleTimeout 3s spark.dynamicAllocation.enabled true spark.hadoop.yarn.timeline-service.enabledfalse spark.shuffle.service.enabled true spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m* {noformat} Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes using latest version. But we started facing DirectBuffer outOfMemory error and exceeding memory limits for executor memoryOverhead issue. To fix that we started tweaking multiple properties but still issue persists. Relevant information is shared below Please let me any other details is requried, Snapshot for DirectMemory Error Stacktrace :- {code:java} 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), shuffleId=7, mapId=141, reduceId=3372, message= org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.Unsafe
[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaushal Prajapati updated SPARK-22458: -- Description: We were using Spark 2.1 from last 6 months to execute multiple spark jobs that is running 15 hour long for 50+ TB of source data with below configurations successfully. {noformat} spark.master yarn spark.driver.cores10 spark.driver.maxResultSize5g spark.driver.memory 20g spark.executor.cores 5 spark.executor.extraJavaOptions -XX:+UseG1GC *-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 *-XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.driver.extraJavaOptions *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.executor.instances 30 spark.executor.memory 30g *spark.kryoserializer.buffer.max 512m* spark.network.timeout 12000s spark.serializer org.apache.spark.serializer.KryoSerializer spark.shuffle.io.preferDirectBufs false spark.sql.catalogImplementation hive spark.sql.shuffle.partitions 5000 spark.yarn.driver.memoryOverhead 1536 spark.yarn.executor.memoryOverhead4096 spark.core.connection.ack.wait.timeout600s spark.scheduler.maxRegisteredResourcesWaitingTime 15s spark.sql.hive.filesourcePartitionFileCacheSize 524288000 spark.dynamicAllocation.executorIdleTimeout 3s spark.dynamicAllocation.enabled true spark.hadoop.yarn.timeline-service.enabledfalse spark.shuffle.service.enabled true spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m* {noformat} Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes using latest version. But we started facing DirectBuffer outOfMemory error and exceeding memory limits for executor memoryOverhead issue. To fix that we started tweaking multiple properties but still issue persists. Relevant information is shared below Please let me any other details is requried, Snapshot for DirectMemory Error Stacktrace :- {code:java} 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), shuffleId=7, mapId=141, reduceId=3372, message= org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.wr
[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaushal Prajapati updated SPARK-22458: -- Description: We were using Spark 2.1 from last 6 months to execute multiple spark jobs that is running 15 hour long for 50+ TB of source data with below configurations successfully. {noformat} spark.master yarn spark.driver.cores10 spark.driver.maxResultSize5g spark.driver.memory 20g spark.executor.cores 5 spark.executor.extraJavaOptions -XX:+UseG1GC *-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 *-XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.driver.extraJavaOptions* -Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.executor.instances 30 spark.executor.memory 30g *spark.kryoserializer.buffer.max 512m* spark.network.timeout 12000s spark.serializer org.apache.spark.serializer.KryoSerializer spark.shuffle.io.preferDirectBufs false spark.sql.catalogImplementation hive spark.sql.shuffle.partitions 5000 spark.yarn.driver.memoryOverhead 1536 spark.yarn.executor.memoryOverhead4096 spark.core.connection.ack.wait.timeout600s spark.scheduler.maxRegisteredResourcesWaitingTime 15s spark.sql.hive.filesourcePartitionFileCacheSize 524288000 spark.dynamicAllocation.executorIdleTimeout 3s spark.dynamicAllocation.enabled true spark.hadoop.yarn.timeline-service.enabledfalse spark.shuffle.service.enabled true spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m* {noformat} Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes using latest version. But we started facing DirectBuffer outOfMemory error and exceeding memory limits for executor memoryOverhead issue. To fix that we started tweaking multiple properties but still issue persists. Relevant information is shared below Please let me any other details is requried, Snapshot for DirectMemory Error Stacktrace :- {code:java} 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), shuffleId=7, mapId=141, reduceId=3372, message= org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.w
[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaushal Prajapati updated SPARK-22458: -- Description: We were using Spark 2.1 from last 6 months to execute multiple spark jobs that is running 15 hour long for 50+ TB of source data with below configurations successfully. {{spark.master yarn spark.driver.cores10 spark.driver.maxResultSize5g spark.driver.memory 20g spark.executor.cores 5 spark.executor.extraJavaOptions -XX:+UseG1GC *-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 *-XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.driver.extraJavaOptions* -Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.executor.instances 30 spark.executor.memory 30g *spark.kryoserializer.buffer.max 512m* spark.network.timeout 12000s spark.serializer org.apache.spark.serializer.KryoSerializer spark.shuffle.io.preferDirectBufs false spark.sql.catalogImplementation hive spark.sql.shuffle.partitions 5000 spark.yarn.driver.memoryOverhead 1536 spark.yarn.executor.memoryOverhead4096 spark.core.connection.ack.wait.timeout600s spark.scheduler.maxRegisteredResourcesWaitingTime 15s spark.sql.hive.filesourcePartitionFileCacheSize 524288000 spark.dynamicAllocation.executorIdleTimeout 3s spark.dynamicAllocation.enabled true spark.hadoop.yarn.timeline-service.enabledfalse spark.shuffle.service.enabled true spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m* }} Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes using latest version. But we started facing DirectBuffer outOfMemory error and exceeding memory limits for executor memoryOverhead issue. To fix that we started tweaking multiple properties but still issue persists. Relevant information is shared below Please let me any other details is requried, Snapshot for DirectMemory Error Stacktrace :- {code:java} 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), shuffleId=7, mapId=141, reduceId=3372, message= org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWr
[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaushal Prajapati updated SPARK-22458: -- Description: We were using Spark 2.1 from last 6 months to execute multiple spark jobs that is running 15 hour long for 50+ TB of source data with below configurations successfully. spark.master yarn spark.driver.cores10 spark.driver.maxResultSize5g spark.driver.memory 20g spark.executor.cores 5 spark.executor.extraJavaOptions -XX:+UseG1GC *-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 *-XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.driver.extraJavaOptions* -Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.executor.instances 30 spark.executor.memory 30g *spark.kryoserializer.buffer.max 512m* spark.network.timeout 12000s spark.serializer org.apache.spark.serializer.KryoSerializer spark.shuffle.io.preferDirectBufs false spark.sql.catalogImplementation hive spark.sql.shuffle.partitions 5000 spark.yarn.driver.memoryOverhead 1536 spark.yarn.executor.memoryOverhead4096 spark.core.connection.ack.wait.timeout600s spark.scheduler.maxRegisteredResourcesWaitingTime 15s spark.sql.hive.filesourcePartitionFileCacheSize 524288000 spark.dynamicAllocation.executorIdleTimeout 3s spark.dynamicAllocation.enabled true spark.hadoop.yarn.timeline-service.enabledfalse spark.shuffle.service.enabled true spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m* Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes using latest version. But we started facing DirectBuffer outOfMemory error and exceeding memory limits for executor memoryOverhead issue. To fix that we started tweaking multiple properties but still issue persists. Relevant information is shared below Please let me any other details is requried, Snapshot for DirectMemory Error Stacktrace :- 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), shuffleId=7, mapId=141, reduceId=3372, message= org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)
[jira] [Created] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
Kaushal Prajapati created SPARK-22458: - Summary: OutOfDirectMemoryError with Spark 2.2 Key: SPARK-22458 URL: https://issues.apache.org/jira/browse/SPARK-22458 Project: Spark Issue Type: Bug Components: Shuffle, SQL, YARN Affects Versions: 2.2.0 Reporter: Kaushal Prajapati Priority: Blocker We were using Spark 2.1 from last 6 months to execute multiple spark jobs that is running 15 hour long for 50+ TB of source data with below configurations successfully. spark.master yarn spark.driver.cores10 spark.driver.maxResultSize5g spark.driver.memory 20g spark.executor.cores 5 spark.executor.extraJavaOptions -XX:+UseG1GC *-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 *-XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.driver.extraJavaOptions* -Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.executor.instances 30 spark.executor.memory 30g *spark.kryoserializer.buffer.max 512m* spark.network.timeout 12000s spark.serializer org.apache.spark.serializer.KryoSerializer spark.shuffle.io.preferDirectBufs false spark.sql.catalogImplementation hive spark.sql.shuffle.partitions 5000 spark.yarn.driver.memoryOverhead 1536 spark.yarn.executor.memoryOverhead4096 spark.core.connection.ack.wait.timeout600s spark.scheduler.maxRegisteredResourcesWaitingTime 15s spark.sql.hive.filesourcePartitionFileCacheSize 524288000 spark.dynamicAllocation.executorIdleTimeout 3s spark.dynamicAllocation.enabled true spark.hadoop.yarn.timeline-service.enabledfalse spark.shuffle.service.enabled true spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m* Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes using latest version. But we started facing DirectBuffer outOfMemory error and exceeding memory limits for executor memoryOverhead issue. To fix that we started tweaking multiple properties but still issue persists. Relevant information is shared below Please let me any other details is requried, Snapshot for DirectMemory Error Stacktrace :- 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), shuffleId=7, mapId=141, reduceId=3372, message= org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon
[jira] [Commented] (SPARK-22259) hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [5, 28, 21, 12]
[ https://issues.apache.org/jira/browse/SPARK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202052#comment-16202052 ] Kaushal Prajapati commented on SPARK-22259: --- Seem like file which you are using is not parquet. > hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file. > expected magic number at tail [80, 65, 82, 49] but found [5, 28, 21, 12] > -- > > Key: SPARK-22259 > URL: https://issues.apache.org/jira/browse/SPARK-22259 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: spark1.6.1,YARN2.6.0 >Reporter: Liu Dinghua > > my codes with errors are as follow: > sqlContext.read.parquet("hdfs://HdfsHA/logrep/1/sspstatistic/") > java.io.IOException: Could not read footer: java.lang.RuntimeException: > hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file (too > small) > at > org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$27.apply(ParquetRelation.scala:786) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$27.apply(ParquetRelation.scala:775) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.RuntimeException: > hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file (too > small) > at > org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:412) > at > org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237) > at > org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > 17/10/12 10:41:04 INFO scheduler.TaskSetManager: Starting task 1.1 in stage > 1.0 (TID 4, slave05, partition 1,PROCESS_LOCAL, 968186 bytes) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19908) Direct buffer memory OOM should not cause stage retries.
[ https://issues.apache.org/jira/browse/SPARK-19908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095203#comment-16095203 ] Kaushal Prajapati commented on SPARK-19908: --- [~bojanbabic] This error comes when your kryobuffer memory exceed default limit (64m). Increase your buffer memory using this property {noformat} spark.kryoserializer.buffer.max=64m {noformat} > Direct buffer memory OOM should not cause stage retries. > > > Key: SPARK-19908 > URL: https://issues.apache.org/jira/browse/SPARK-19908 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Priority: Minor > > Currently if there is java.lang.OutOfMemoryError: Direct buffer memory, the > exception will be changed to FetchFailedException, causing stage retries. > org.apache.spark.shuffle.FetchFailedException: Direct buffer memory > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:40) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887) > at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at > io.netty.buffer.AbstractByteB
[jira] [Commented] (SPARK-21316) Dataset Union output is not consistent with the column sequence
[ https://issues.apache.org/jira/browse/SPARK-21316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16076337#comment-16076337 ] Kaushal Prajapati commented on SPARK-21316: --- [~dongjoon] It works when the column names are specified in alphabetical order. What if my column names and schema are not in sync with the order, in that case the output again changes. Also in such cases it'll be too cumbersome to handle column sequence and their alphabetical order. If I change the 'name' to 'a' and 'age' to 'b', the above code works as the columns now are in alphabetical order. Adding to this, when both my datasets are of the same schema(Person.class), then why the column order should be even considered while talking the union. According to my understanding It should not be considered. {code:java} ds1.select("name","age").as(Encoders.bean(Person.class)).union(ds2).show(); {code} In above snippet, I'm creating a dataset of rows using column selection and then again converting back to Person.class schema. So the order should not matter. > Dataset Union output is not consistent with the column sequence > --- > > Key: SPARK-21316 > URL: https://issues.apache.org/jira/browse/SPARK-21316 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.1.0 >Reporter: Kaushal Prajapati > Labels: patch > Original Estimate: 168h > Remaining Estimate: 168h > > if i take union of 2 datasets with similar schema, the output should remain > same even if i change the sequence of columns while creating the dataset. > i am attaching the code snippet for details. > {code:java} > public class Person{ > public String name; > public String age; > public Person(String name, String age) { > this.name = name; > this.age = age; > } > public String getName() {return name;} > public void setName(String name) {this.name = name;} > public String getAge() {return age;} > public void setAge(String age) {this.age = age;} > } > {code} > {code:java} > public class Test { > public static void main(String arg[]) throws Exception { > SparkSession spark = SparkConnection.getSpark(); > List list1 = new ArrayList<>(); > list1.add(new Person("kaushal", "25")); > list1.add(new Person("aman", "26")); > List list2 = new ArrayList<>(); > list2.add(new Person("sapan", "25")); > list2.add(new Person("yati", "26")); > Dataset ds1 = spark.createDataset(list1, > Encoders.bean(Person.class)); > Dataset ds2 = spark.createDataset(list2, > Encoders.bean(Person.class)); > ds1.show(); > ds2.show(); > > ds1.select("name","age").as(Encoders.bean(Person.class)).union(ds2).show(); > } > } > {code} > output :- > {code:java} > +---+---+ > |age| name| > +---+---+ > | 25|kaushal| > | 26| aman| > +---+---+ > +---+-+ > |age| name| > +---+-+ > | 25|sapan| > | 26| yati| > +---+-+ > +---+-+ > | name| age| > +---+-+ > |kaushal| 25| > | aman| 26| > | 25|sapan| > | 26| yati| > +---+-+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21316) Dataset Union output is not consistent with the column sequence
Kaushal Prajapati created SPARK-21316: - Summary: Dataset Union output is not consistent with the column sequence Key: SPARK-21316 URL: https://issues.apache.org/jira/browse/SPARK-21316 Project: Spark Issue Type: Bug Components: Optimizer, SQL Affects Versions: 2.1.0 Reporter: Kaushal Prajapati Priority: Critical if i take union of 2 datasets with similar schema, the output should remain same even if i change the sequence of columns while creating the dataset. i am attaching the code snippet for details. {code:java} public class Person{ public String name; public String age; public Person(String name, String age) { this.name = name; this.age = age; } public String getName() {return name;} public void setName(String name) {this.name = name;} public String getAge() {return age;} public void setAge(String age) {this.age = age;} } {code} {code:java} public class Test { public static void main(String arg[]) throws Exception { SparkSession spark = SparkConnection.getSpark(); List list1 = new ArrayList<>(); list1.add(new Person("kaushal", "25")); list1.add(new Person("aman", "26")); List list2 = new ArrayList<>(); list2.add(new Person("sapan", "25")); list2.add(new Person("yati", "26")); Dataset ds1 = spark.createDataset(list1, Encoders.bean(Person.class)); Dataset ds2 = spark.createDataset(list2, Encoders.bean(Person.class)); ds1.show(); ds2.show(); ds1.select("name","age").as(Encoders.bean(Person.class)).union(ds2).show(); } } {code} output :- {code:java} +---+---+ |age| name| +---+---+ | 25|kaushal| | 26| aman| +---+---+ +---+-+ |age| name| +---+-+ | 25|sapan| | 26| yati| +---+-+ +---+-+ | name| age| +---+-+ |kaushal| 25| | aman| 26| | 25|sapan| | 26| yati| +---+-+ {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19908) Direct buffer memory OOM should not cause stage retries.
[ https://issues.apache.org/jira/browse/SPARK-19908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068106#comment-16068106 ] Kaushal Prajapati commented on SPARK-19908: --- [~zhanzhang] can you plz share some example code for which you are getting this error? > Direct buffer memory OOM should not cause stage retries. > > > Key: SPARK-19908 > URL: https://issues.apache.org/jira/browse/SPARK-19908 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: Zhan Zhang >Priority: Minor > > Currently if there is java.lang.OutOfMemoryError: Direct buffer memory, the > exception will be changed to FetchFailedException, causing stage retries. > org.apache.spark.shuffle.FetchFailedException: Direct buffer memory > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:40) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887) > at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > at > io.netty.buffer.AbstractByteBufAll
[jira] [Commented] (SPARK-19289) UnCache Dataset using Name
[ https://issues.apache.org/jira/browse/SPARK-19289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833983#comment-15833983 ] Kaushal Prajapati commented on SPARK-19289: --- [~srowen], yes you are right, names are not necessarily unique, then if we are able to list all the datasets like the same that we are doing for RDDs using *sc.getPersistentRDDs* to list all the RDDs, it would be useful because if i am maintaining the unique names for Datasets in a application, i can easily uncache it or if i have a group of datasets with the same name, i can also uncache all the datasets using this feature. > UnCache Dataset using Name > -- > > Key: SPARK-19289 > URL: https://issues.apache.org/jira/browse/SPARK-19289 > Project: Spark > Issue Type: Wish > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Kaushal Prajapati >Priority: Minor > Labels: features > > We can Cache and Uncache any table using its name in Spark Sql. > {code} > df.createTempView("myTable") > sqlContext.cacheTable("myTable") > sqlContext.uncacheTable("myTable") > {code} > Likewise if it is possible to have some kind of uniqueness for names in > DataSets and an abstraction like the same that we have for tables. It would > be very useful > {code} > scala> val df = sc.range(1,1000).toDF > df: org.apache.spark.sql.DataFrame = [value: bigint] > scala> df.setName("MyDataset") > res0: df.type = MyDataset > scala> df.cache > res1: df.type = MyDataset > sqlContext.getDataSet("MyDataset") > sqlContext.uncacheDataSet("MyDataset") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19289) UnCache Dataset using Name
[ https://issues.apache.org/jira/browse/SPARK-19289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaushal Prajapati updated SPARK-19289: -- External issue ID: Related to - https://issues.apache.org/jira/browse/SPARK-8480 (was: https://issues.apache.org/jira/browse/SPARK-8480) > UnCache Dataset using Name > -- > > Key: SPARK-19289 > URL: https://issues.apache.org/jira/browse/SPARK-19289 > Project: Spark > Issue Type: Wish > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Kaushal Prajapati >Priority: Minor > Labels: features > > We can Cache and Uncache any table using its name in Spark Sql. > {code} > df.createTempView("myTable") > sqlContext.cacheTable("myTable") > sqlContext.uncacheTable("myTable") > {code} > Likewise if it is possible to have some kind of uniqueness for names in > DataSets and an abstraction like the same that we have for tables. It would > be very useful > {code} > scala> val df = sc.range(1,1000).toDF > df: org.apache.spark.sql.DataFrame = [value: bigint] > scala> df.setName("MyDataset") > res0: df.type = MyDataset > scala> df.cache > res1: df.type = MyDataset > sqlContext.getDataSet("MyDataset") > sqlContext.uncacheDataSet("MyDataset") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19289) UnCache Dataset using Name
Kaushal Prajapati created SPARK-19289: - Summary: UnCache Dataset using Name Key: SPARK-19289 URL: https://issues.apache.org/jira/browse/SPARK-19289 Project: Spark Issue Type: Wish Components: Spark Core, SQL Affects Versions: 2.1.0 Reporter: Kaushal Prajapati Priority: Minor We can Cache and Uncache any table using its name in Spark Sql. {code} df.createTempView("myTable") sqlContext.cacheTable("myTable") sqlContext.uncacheTable("myTable") {code} Likewise if it is possible to have some kind of uniqueness for names in DataSets and an abstraction like the same that we have for tables. It would be very useful {code} scala> val df = sc.range(1,1000).toDF df: org.apache.spark.sql.DataFrame = [value: bigint] scala> df.setName("MyDataset") res0: df.type = MyDataset scala> df.cache res1: df.type = MyDataset sqlContext.getDataSet("MyDataset") sqlContext.uncacheDataSet("MyDataset") {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8480) Add setName for Dataframe
[ https://issues.apache.org/jira/browse/SPARK-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827980#comment-15827980 ] Kaushal Prajapati commented on SPARK-8480: -- Yes [~emlyn], Its correct that the name is not identical, but if it possible to have some kind of uniqueness for names or an abstraction like the same that we have for tables. {code} df.createTempView("myTable") sqlContext.cacheTable("myTable") sqlContext.uncacheTable("myTable") {code} Same like for Datasets, It would be very useful {code} scala> val df = sc.range(1,1000).toDF df: org.apache.spark.sql.DataFrame = [value: bigint] scala> df.setName("MyDataset") res0: df.type = MyDataset scala> df.cache res1: df.type = MyDataset sqlContext.getDataSet("MyDataset") sqlContext.uncacheDataSet("MyDataset") {code} > Add setName for Dataframe > - > > Key: SPARK-8480 > URL: https://issues.apache.org/jira/browse/SPARK-8480 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 1.4.0 >Reporter: Peter Rudenko >Priority: Minor > > Rdd has a method setName, so in spark UI, it's more easily to understand > what's this cache for. E.g. ("data for LogisticRegression model", etc.). > Would be nice to have the same method for Dataframe, since it displays a > logical schema, in cache page, which could be quite big. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8480) Add setName for Dataframe
[ https://issues.apache.org/jira/browse/SPARK-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827515#comment-15827515 ] Kaushal Prajapati commented on SPARK-8480: -- For example, Using SparkContext we can get all cached Rdds {code:title=Code|borderStyle=solid} scala> val rdd = sc.range(1,1000) scala> rdd.setName("myRdd") scala> rdd.cache scala> rdd.count scala> sc.getPersistentRDDs.foreach(println) (9,kaushal MapPartitionsRDD[9] at rdd at :27) (11,myRdd MapPartitionsRDD[11] at range at :27) sc.getPersistentRDDs.filter(_._2.name == "myRdd").foreach(_._2.unpersist()) scala> sc.getPersistentRDDs.foreach(println) (9,kaushal MapPartitionsRDD[9] at rdd at :27) {code} And we can unpersist any Rdd with valid name. Likewise same in DataSet, if we will able to list all cached DataSets with corresponding names then it will be good option to unpersist any DataSet using particular name. > Add setName for Dataframe > - > > Key: SPARK-8480 > URL: https://issues.apache.org/jira/browse/SPARK-8480 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 1.4.0 >Reporter: Peter Rudenko >Priority: Minor > > Rdd has a method setName, so in spark UI, it's more easily to understand > what's this cache for. E.g. ("data for LogisticRegression model", etc.). > Would be nice to have the same method for Dataframe, since it displays a > logical schema, in cache page, which could be quite big. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8480) Add setName for Dataframe
[ https://issues.apache.org/jira/browse/SPARK-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15825745#comment-15825745 ] Kaushal Prajapati commented on SPARK-8480: -- Nice feature, but it better to have one more option to unpersist my DataFrame or DataSet using the same name. > Add setName for Dataframe > - > > Key: SPARK-8480 > URL: https://issues.apache.org/jira/browse/SPARK-8480 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 1.4.0 >Reporter: Peter Rudenko >Priority: Minor > > Rdd has a method setName, so in spark UI, it's more easily to understand > what's this cache for. E.g. ("data for LogisticRegression model", etc.). > Would be nice to have the same method for Dataframe, since it displays a > logical schema, in cache page, which could be quite big. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13111) Spark UI is showing negative number of sessions
[ https://issues.apache.org/jira/browse/SPARK-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaushal Prajapati updated SPARK-13111: -- Attachment: spark_ui_session.jpg > Spark UI is showing negative number of sessions > --- > > Key: SPARK-13111 > URL: https://issues.apache.org/jira/browse/SPARK-13111 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Kaushal Prajapati > Attachments: spark_ui_session.jpg > > > Number of Session is coming in negative in Spark Sql -> JDBC/ODBC Server -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13111) Spark UI is showing negative number of sessions
Kaushal Prajapati created SPARK-13111: - Summary: Spark UI is showing negative number of sessions Key: SPARK-13111 URL: https://issues.apache.org/jira/browse/SPARK-13111 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.5.0 Reporter: Kaushal Prajapati Number of Session is coming in negative in Spark Sql -> JDBC/ODBC Server -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9345) Failure to cleanup on exceptions causes persistent I/O problems later on
[ https://issues.apache.org/jira/browse/SPARK-9345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642648#comment-14642648 ] Kaushal Prajapati commented on SPARK-9345: -- I think this problem is related to hive-site.xml file. Application context is not able to find out that file or properties. > Failure to cleanup on exceptions causes persistent I/O problems later on > > > Key: SPARK-9345 > URL: https://issues.apache.org/jira/browse/SPARK-9345 > Project: Spark > Issue Type: Bug >Affects Versions: 1.3.1, 1.4.0, 1.4.1 > Environment: Ubuntu on AWS >Reporter: Simeon Simeonov >Priority: Minor > > When using spark-shell in local mode, I've observed the following behavior on > a number of nodes: > # Some operation generates an exception related to Spark SQL processing via > {{HiveContext}}. > # From that point on, nothing could be written to Hive with {{saveAsTable}}. > # Another identically-configured version of Spark on the same machine may not > exhibit the problem initially but, with enough exceptions, it will start > exhibiting the problem also. > # A new identically-configured installation of the same version on the same > machine would exhibit the problem. > The error is always related to inability to create a temporary folder on HDFS: > {code} > 15/07/25 16:03:35 ERROR InsertIntoHadoopFsRelation: Aborting task. > java.io.IOException: Mkdirs failed to create > file:/user/hive/warehouse/test/_temporary/0/_temporary/attempt_201507251603_0001_m_01_0 > (exists=false, cwd=file:/home/ubuntu) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786) > at parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:154) > at > parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279) > at > parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) > at > org.apache.spark.sql.parquet.ParquetOutputWriter.(newParquet.scala:83) > at > org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:229) > at > org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:470) > at > org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:360) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:172) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > ... > {code} > The behavior does not seem related to HDFS as it persists even if the HDFS > volume is reformatted. > The behavior is difficult to reproduce reliably but consistently observable > with sufficient Spark SQL experimentation (dozens of exceptions arising from > Spark SQL processing). > The likelihood of this happening goes up substantially if some Spark SQL > operation runs out of memory, which suggests > that the problem is related to cleanup. > In this gist ([https://gist.github.com/ssimeonov/72a64947bc33628d2d11]) you > can see how on the same machine, identically configured 1.3.1 and 1.4.1 > versions sharing the same HDFS system and Hive metastore, behave differently. > 1.3.1 can write to Hive. 1.4.1 cannot. The behavior started happening on > 1.4.1 after an out of memory exception on a large job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org