[jira] [Updated] (SPARK-30002) Reuse SparkSession in pyspark via Gateway

2019-11-22 Thread Kaushal Prajapati (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaushal Prajapati updated SPARK-30002:
--
Description: 
In PySpark, we create SparkContext via user spark configurations or the default 
ones, and it gets launched through py4j gateway internally.

Let's say if I have launched py4j gateway from another application then to 
communicate with the same py4j gateway I have to set below configuration:-

 
{code:java}
export PYSPARK_GATEWAY_PORT=12345
export PYSPARK_GATEWAY_SECRET=***
{code}
 

So when PySpark tries to create its own SparkContext after the communication 
has been set up, it doesn't check whether there is any available SparkContext 
in the same JVM.

Current code snippet:-

 
{code:java}
def _initialize_context(self, jconf):
"""
Initialize SparkContext in function to allow subclass specific 
initialization
"""
return self._jvm.JavaSparkContext(jconf){code}
 

 

After changing it to the following, it works fine for me.
{code:java}
def _initialize_context(self, jconf):
"""
Initialize SparkContext in function to allow subclass specific 
initialization
"""
return 
self._jvm.JavaSparkContext(self._jvm.org.apache.spark.SparkContext.getOrCreate(jconf)){code}
 

It looks like a good use case for improvement.

  was:
In PySpark, we create SparkContext via user spark configurations or the default 
ones, and it gets launched through py4j gateway internally.

Let's say if I have launched py4j gateway from another application then to 
communicate with the same py4j gateway I have to set below configuration:-

 
{code:java}
export PYSPARK_GATEWAY_PORT=12345
export PYSPARK_GATEWAY_SECRET=***
{code}
 

So when PySpark tries to create its own SparkContext after the communication 
has been set up, it doesn't check whether there is any available SparkContext 
in the same JVM.

Current code snippet:-

 
{code:java}
def _initialize_context(self, jconf):
"""
Initialize SparkContext in function to allow subclass specific 
initialization
"""
return self._jvm.JavaSparkContext(jconf){code}
 

 

If we change it to the following, it works fine for me.
{code:java}
def _initialize_context(self, jconf):
"""
Initialize SparkContext in function to allow subclass specific 
initialization
"""
return 
self._jvm.JavaSparkContext(self._jvm.org.apache.spark.SparkContext.getOrCreate(jconf)){code}
 

It looks like a good use case for improvement.


> Reuse SparkSession in pyspark via Gateway
> -
>
> Key: SPARK-30002
> URL: https://issues.apache.org/jira/browse/SPARK-30002
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0, 2.4.4
>Reporter: Kaushal Prajapati
>Priority: Minor
>
> In PySpark, we create SparkContext via user spark configurations or the 
> default ones, and it gets launched through py4j gateway internally.
> Let's say if I have launched py4j gateway from another application then to 
> communicate with the same py4j gateway I have to set below configuration:-
>  
> {code:java}
> export PYSPARK_GATEWAY_PORT=12345
> export PYSPARK_GATEWAY_SECRET=***
> {code}
>  
> So when PySpark tries to create its own SparkContext after the communication 
> has been set up, it doesn't check whether there is any available SparkContext 
> in the same JVM.
> Current code snippet:-
>  
> {code:java}
> def _initialize_context(self, jconf):
> """
> Initialize SparkContext in function to allow subclass specific 
> initialization
> """
> return self._jvm.JavaSparkContext(jconf){code}
>  
>  
> After changing it to the following, it works fine for me.
> {code:java}
> def _initialize_context(self, jconf):
> """
> Initialize SparkContext in function to allow subclass specific 
> initialization
> """
> return 
> self._jvm.JavaSparkContext(self._jvm.org.apache.spark.SparkContext.getOrCreate(jconf)){code}
>  
> It looks like a good use case for improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30002) Reuse SparkSession in pyspark via Gateway

2019-11-22 Thread Kaushal Prajapati (Jira)
Kaushal Prajapati created SPARK-30002:
-

 Summary: Reuse SparkSession in pyspark via Gateway
 Key: SPARK-30002
 URL: https://issues.apache.org/jira/browse/SPARK-30002
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.4, 2.4.0
Reporter: Kaushal Prajapati


In PySpark, we create SparkContext via user spark configurations or the default 
ones, and it gets launched through py4j gateway internally.

Let's say if I have launched py4j gateway from another application then to 
communicate with the same py4j gateway I have to set below configuration:-

 
{code:java}
export PYSPARK_GATEWAY_PORT=12345
export PYSPARK_GATEWAY_SECRET=***
{code}
 

So when PySpark tries to create its own SparkContext after the communication 
has been set up, it doesn't check whether there is any available SparkContext 
in the same JVM.

Current code snippet:-

 
{code:java}
def _initialize_context(self, jconf):
"""
Initialize SparkContext in function to allow subclass specific 
initialization
"""
return self._jvm.JavaSparkContext(jconf){code}
 

 

If we change it to the following, it works fine for me.
{code:java}
def _initialize_context(self, jconf):
"""
Initialize SparkContext in function to allow subclass specific 
initialization
"""
return 
self._jvm.JavaSparkContext(self._jvm.org.apache.spark.SparkContext.getOrCreate(jconf)){code}
 

It looks like a good use case for improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Kaushal Prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaushal Prajapati reopened SPARK-22458:
---

It's working with same configs using Spark 2.1, why not on Spark 2.2 and why 
Spark 2.2 demand extra memory and how much extra?

> OutOfDirectMemoryError with Spark 2.2
> -
>
> Key: SPARK-22458
> URL: https://issues.apache.org/jira/browse/SPARK-22458
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL, YARN
>Affects Versions: 2.2.0
>Reporter: Kaushal Prajapati
>Priority: Blocker
>
> We were using Spark 2.1 from last 6 months to execute multiple spark jobs 
> that is running 15 hour long for 50+ TB of source data with below 
> configurations successfully. 
> {quote}spark.master  yarn
> spark.driver.cores10
> spark.driver.maxResultSize5g
> spark.driver.memory   20g
> spark.executor.cores  5
> spark.executor.extraJavaOptions   *-XX:+UseG1GC 
> -Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 
> *-XX:MaxDirectMemorySize=2048m* 
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.driver.extraJavaOptions   
> *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.executor.instances  30
> spark.executor.memory 30g
> *spark.kryoserializer.buffer.max   512m*
> spark.network.timeout 12000s
> spark.serializer  
> org.apache.spark.serializer.KryoSerializer
> spark.shuffle.io.preferDirectBufs false
> spark.sql.catalogImplementation   hive
> spark.sql.shuffle.partitions  5000
> spark.yarn.driver.memoryOverhead  1536
> spark.yarn.executor.memoryOverhead4096
> spark.core.connection.ack.wait.timeout600s
> spark.scheduler.maxRegisteredResourcesWaitingTime 15s
> spark.sql.hive.filesourcePartitionFileCacheSize   524288000
> spark.dynamicAllocation.executorIdleTimeout   3s
> spark.dynamicAllocation.enabled   true
> spark.hadoop.yarn.timeline-service.enabledfalse
> spark.shuffle.service.enabled true
> spark.yarn.am.extraJavaOptions*-Dhdp.version=2.5.3.0-37 
> -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote}
> Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
> using latest version. But we started facing DirectBuffer outOfMemory error 
> and exceeding memory limits for executor memoryOverhead issue. To fix that we 
> started tweaking multiple properties but still issue persists. Relevant 
> information is shared below
> Please let me any other details is requried,
>   
> Snapshot for DirectMemory Error Stacktrace :- 
> {code:java}
> 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
> stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
> FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
> shuffleId=7, mapId=141, reduceId=3372, message=
> org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 
> byte(s) of direct memory (used: 1073699840, max: 1073741824)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeSta

[jira] [Commented] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Kaushal Prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240537#comment-16240537
 ] 

Kaushal Prajapati commented on SPARK-22458:
---

yes, but we tried multiple times, same code is working with Spark 2.1 not with 
Spark 2.2.

> OutOfDirectMemoryError with Spark 2.2
> -
>
> Key: SPARK-22458
> URL: https://issues.apache.org/jira/browse/SPARK-22458
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL, YARN
>Affects Versions: 2.2.0
>Reporter: Kaushal Prajapati
>Priority: Blocker
>
> We were using Spark 2.1 from last 6 months to execute multiple spark jobs 
> that is running 15 hour long for 50+ TB of source data with below 
> configurations successfully. 
> {quote}spark.master  yarn
> spark.driver.cores10
> spark.driver.maxResultSize5g
> spark.driver.memory   20g
> spark.executor.cores  5
> spark.executor.extraJavaOptions   *-XX:+UseG1GC 
> -Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 
> *-XX:MaxDirectMemorySize=2048m* 
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.driver.extraJavaOptions   
> *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.executor.instances  30
> spark.executor.memory 30g
> *spark.kryoserializer.buffer.max   512m*
> spark.network.timeout 12000s
> spark.serializer  
> org.apache.spark.serializer.KryoSerializer
> spark.shuffle.io.preferDirectBufs false
> spark.sql.catalogImplementation   hive
> spark.sql.shuffle.partitions  5000
> spark.yarn.driver.memoryOverhead  1536
> spark.yarn.executor.memoryOverhead4096
> spark.core.connection.ack.wait.timeout600s
> spark.scheduler.maxRegisteredResourcesWaitingTime 15s
> spark.sql.hive.filesourcePartitionFileCacheSize   524288000
> spark.dynamicAllocation.executorIdleTimeout   3s
> spark.dynamicAllocation.enabled   true
> spark.hadoop.yarn.timeline-service.enabledfalse
> spark.shuffle.service.enabled true
> spark.yarn.am.extraJavaOptions*-Dhdp.version=2.5.3.0-37 
> -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote}
> Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
> using latest version. But we started facing DirectBuffer outOfMemory error 
> and exceeding memory limits for executor memoryOverhead issue. To fix that we 
> started tweaking multiple properties but still issue persists. Relevant 
> information is shared below
> Please let me any other details is requried,
>   
> Snapshot for DirectMemory Error Stacktrace :- 
> {code:java}
> 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
> stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
> FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
> shuffleId=7, mapId=141, reduceId=3372, message=
> org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 
> byte(s) of direct memory (used: 1073699840, max: 1073741824)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$

[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Kaushal Prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaushal Prajapati updated SPARK-22458:
--
Description: 
We were using Spark 2.1 from last 6 months to execute multiple spark jobs that 
is running 15 hour long for 50+ TB of source data with below configurations 
successfully. 


{quote}spark.master  yarn
spark.driver.cores10
spark.driver.maxResultSize5g
spark.driver.memory   20g
spark.executor.cores  5
spark.executor.extraJavaOptions   *-XX:+UseG1GC 
-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 
*-XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.driver.extraJavaOptions   
*-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.executor.instances  30
spark.executor.memory 30g
*spark.kryoserializer.buffer.max   512m*

spark.network.timeout 12000s
spark.serializer  
org.apache.spark.serializer.KryoSerializer
spark.shuffle.io.preferDirectBufs false
spark.sql.catalogImplementation   hive
spark.sql.shuffle.partitions  5000
spark.yarn.driver.memoryOverhead  1536
spark.yarn.executor.memoryOverhead4096
spark.core.connection.ack.wait.timeout600s
spark.scheduler.maxRegisteredResourcesWaitingTime 15s
spark.sql.hive.filesourcePartitionFileCacheSize   524288000


spark.dynamicAllocation.executorIdleTimeout   3s
spark.dynamicAllocation.enabled   true
spark.hadoop.yarn.timeline-service.enabledfalse
spark.shuffle.service.enabled true
spark.yarn.am.extraJavaOptions*-Dhdp.version=2.5.3.0-37 
-Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote}


Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
using latest version. But we started facing DirectBuffer outOfMemory error and 
exceeding memory limits for executor memoryOverhead issue. To fix that we 
started tweaking multiple properties but still issue persists. Relevant 
information is shared below

Please let me any other details is requried,

Snapshot for DirectMemory Error Stacktrace :- 

{code:java}
10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
shuffleId=7, mapId=141, reduceId=3372, message=
org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) 
of direct memory (used: 1073699840, max: 1073741824)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(Unsafe

[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Kaushal Prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaushal Prajapati updated SPARK-22458:
--
Description: 
We were using Spark 2.1 from last 6 months to execute multiple spark jobs that 
is running 15 hour long for 50+ TB of source data with below configurations 
successfully. 


{noformat}
spark.master  yarn
spark.driver.cores10
spark.driver.maxResultSize5g
spark.driver.memory   20g
spark.executor.cores  5
spark.executor.extraJavaOptions   -XX:+UseG1GC 
{color:red}-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6{color} 
-XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.driver.extraJavaOptions
*-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.executor.instances  30
spark.executor.memory 30g
*spark.kryoserializer.buffer.max   512m*

spark.network.timeout 12000s
spark.serializer  
org.apache.spark.serializer.KryoSerializer
spark.shuffle.io.preferDirectBufs false
spark.sql.catalogImplementation   hive
spark.sql.shuffle.partitions  5000
spark.yarn.driver.memoryOverhead  1536
spark.yarn.executor.memoryOverhead4096
spark.core.connection.ack.wait.timeout600s
spark.scheduler.maxRegisteredResourcesWaitingTime 15s
spark.sql.hive.filesourcePartitionFileCacheSize   524288000


spark.dynamicAllocation.executorIdleTimeout   3s
spark.dynamicAllocation.enabled   true
spark.hadoop.yarn.timeline-service.enabledfalse
spark.shuffle.service.enabled true
spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * 
-Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*
{noformat}


Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
using latest version. But we started facing DirectBuffer outOfMemory error and 
exceeding memory limits for executor memoryOverhead issue. To fix that we 
started tweaking multiple properties but still issue persists. Relevant 
information is shared below

Please let me any other details is requried,

Snapshot for DirectMemory Error Stacktrace :- 

{code:java}
10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
shuffleId=7, mapId=141, reduceId=3372, message=
org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) 
of direct memory (used: 1073699840, max: 1073741824)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.Unsafe

[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Kaushal Prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaushal Prajapati updated SPARK-22458:
--
Description: 
We were using Spark 2.1 from last 6 months to execute multiple spark jobs that 
is running 15 hour long for 50+ TB of source data with below configurations 
successfully. 


{noformat}
spark.master  yarn
spark.driver.cores10
spark.driver.maxResultSize5g
spark.driver.memory   20g
spark.executor.cores  5
spark.executor.extraJavaOptions   -XX:+UseG1GC 
*-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 
*-XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.driver.extraJavaOptions
*-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.executor.instances  30
spark.executor.memory 30g
*spark.kryoserializer.buffer.max   512m*

spark.network.timeout 12000s
spark.serializer  
org.apache.spark.serializer.KryoSerializer
spark.shuffle.io.preferDirectBufs false
spark.sql.catalogImplementation   hive
spark.sql.shuffle.partitions  5000
spark.yarn.driver.memoryOverhead  1536
spark.yarn.executor.memoryOverhead4096
spark.core.connection.ack.wait.timeout600s
spark.scheduler.maxRegisteredResourcesWaitingTime 15s
spark.sql.hive.filesourcePartitionFileCacheSize   524288000


spark.dynamicAllocation.executorIdleTimeout   3s
spark.dynamicAllocation.enabled   true
spark.hadoop.yarn.timeline-service.enabledfalse
spark.shuffle.service.enabled true
spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * 
-Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*
{noformat}


Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
using latest version. But we started facing DirectBuffer outOfMemory error and 
exceeding memory limits for executor memoryOverhead issue. To fix that we 
started tweaking multiple properties but still issue persists. Relevant 
information is shared below

Please let me any other details is requried,

Snapshot for DirectMemory Error Stacktrace :- 

{code:java}
10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
shuffleId=7, mapId=141, reduceId=3372, message=
org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) 
of direct memory (used: 1073699840, max: 1073741824)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.wr

[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Kaushal Prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaushal Prajapati updated SPARK-22458:
--
Description: 
We were using Spark 2.1 from last 6 months to execute multiple spark jobs that 
is running 15 hour long for 50+ TB of source data with below configurations 
successfully. 


{noformat}
spark.master  yarn
spark.driver.cores10
spark.driver.maxResultSize5g
spark.driver.memory   20g
spark.executor.cores  5
spark.executor.extraJavaOptions   -XX:+UseG1GC 
*-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 
*-XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.driver.extraJavaOptions* 
-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.executor.instances  30
spark.executor.memory 30g
*spark.kryoserializer.buffer.max   512m*

spark.network.timeout 12000s
spark.serializer  
org.apache.spark.serializer.KryoSerializer
spark.shuffle.io.preferDirectBufs false
spark.sql.catalogImplementation   hive
spark.sql.shuffle.partitions  5000
spark.yarn.driver.memoryOverhead  1536
spark.yarn.executor.memoryOverhead4096
spark.core.connection.ack.wait.timeout600s
spark.scheduler.maxRegisteredResourcesWaitingTime 15s
spark.sql.hive.filesourcePartitionFileCacheSize   524288000


spark.dynamicAllocation.executorIdleTimeout   3s
spark.dynamicAllocation.enabled   true
spark.hadoop.yarn.timeline-service.enabledfalse
spark.shuffle.service.enabled true
spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * 
-Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*
{noformat}


Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
using latest version. But we started facing DirectBuffer outOfMemory error and 
exceeding memory limits for executor memoryOverhead issue. To fix that we 
started tweaking multiple properties but still issue persists. Relevant 
information is shared below

Please let me any other details is requried,

Snapshot for DirectMemory Error Stacktrace :- 

{code:java}
10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
shuffleId=7, mapId=141, reduceId=3372, message=
org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) 
of direct memory (used: 1073699840, max: 1073741824)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.w

[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Kaushal Prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaushal Prajapati updated SPARK-22458:
--
Description: 
We were using Spark 2.1 from last 6 months to execute multiple spark jobs that 
is running 15 hour long for 50+ TB of source data with below configurations 
successfully. 

{{spark.master  yarn
spark.driver.cores10
spark.driver.maxResultSize5g
spark.driver.memory   20g
spark.executor.cores  5
spark.executor.extraJavaOptions   -XX:+UseG1GC 
*-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 
*-XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.driver.extraJavaOptions* 
-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.executor.instances  30
spark.executor.memory 30g
*spark.kryoserializer.buffer.max   512m*

spark.network.timeout 12000s
spark.serializer  
org.apache.spark.serializer.KryoSerializer
spark.shuffle.io.preferDirectBufs false
spark.sql.catalogImplementation   hive
spark.sql.shuffle.partitions  5000
spark.yarn.driver.memoryOverhead  1536
spark.yarn.executor.memoryOverhead4096
spark.core.connection.ack.wait.timeout600s
spark.scheduler.maxRegisteredResourcesWaitingTime 15s
spark.sql.hive.filesourcePartitionFileCacheSize   524288000


spark.dynamicAllocation.executorIdleTimeout   3s
spark.dynamicAllocation.enabled   true
spark.hadoop.yarn.timeline-service.enabledfalse
spark.shuffle.service.enabled true
spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * 
-Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*
}}
Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
using latest version. But we started facing DirectBuffer outOfMemory error and 
exceeding memory limits for executor memoryOverhead issue. To fix that we 
started tweaking multiple properties but still issue persists. Relevant 
information is shared below

Please let me any other details is requried,

Snapshot for DirectMemory Error Stacktrace :- 

{code:java}
10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
shuffleId=7, mapId=141, reduceId=3372, message=
org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) 
of direct memory (used: 1073699840, max: 1073741824)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWr

[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Kaushal Prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaushal Prajapati updated SPARK-22458:
--
Description: 
We were using Spark 2.1 from last 6 months to execute multiple spark jobs that 
is running 15 hour long for 50+ TB of source data with below configurations 
successfully. 

spark.master  yarn
spark.driver.cores10
spark.driver.maxResultSize5g
spark.driver.memory   20g
spark.executor.cores  5
spark.executor.extraJavaOptions   -XX:+UseG1GC 
*-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 
*-XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.driver.extraJavaOptions* 
-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.executor.instances  30
spark.executor.memory 30g
*spark.kryoserializer.buffer.max   512m*

spark.network.timeout 12000s
spark.serializer  
org.apache.spark.serializer.KryoSerializer
spark.shuffle.io.preferDirectBufs false
spark.sql.catalogImplementation   hive
spark.sql.shuffle.partitions  5000
spark.yarn.driver.memoryOverhead  1536
spark.yarn.executor.memoryOverhead4096
spark.core.connection.ack.wait.timeout600s
spark.scheduler.maxRegisteredResourcesWaitingTime 15s
spark.sql.hive.filesourcePartitionFileCacheSize   524288000


spark.dynamicAllocation.executorIdleTimeout   3s
spark.dynamicAllocation.enabled   true
spark.hadoop.yarn.timeline-service.enabledfalse
spark.shuffle.service.enabled true
spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * 
-Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*

Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
using latest version. But we started facing DirectBuffer outOfMemory error and 
exceeding memory limits for executor memoryOverhead issue. To fix that we 
started tweaking multiple properties but still issue persists. Relevant 
information is shared below

Please let me any other details is requried,

Snapshot for DirectMemory Error Stacktrace :- 
10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
shuffleId=7, mapId=141, reduceId=3372, message=
org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) 
of direct memory (used: 1073699840, max: 1073741824)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)
  

[jira] [Created] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Kaushal Prajapati (JIRA)
Kaushal Prajapati created SPARK-22458:
-

 Summary: OutOfDirectMemoryError with Spark 2.2
 Key: SPARK-22458
 URL: https://issues.apache.org/jira/browse/SPARK-22458
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, SQL, YARN
Affects Versions: 2.2.0
Reporter: Kaushal Prajapati
Priority: Blocker


We were using Spark 2.1 from last 6 months to execute multiple spark jobs that 
is running 15 hour long for 50+ TB of source data with below configurations 
successfully. 

spark.master  yarn
spark.driver.cores10
spark.driver.maxResultSize5g
spark.driver.memory   20g
spark.executor.cores  5
spark.executor.extraJavaOptions   -XX:+UseG1GC 
*-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 
*-XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.driver.extraJavaOptions* 
-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.executor.instances  30
spark.executor.memory 30g
*spark.kryoserializer.buffer.max   512m*

spark.network.timeout 12000s
spark.serializer  
org.apache.spark.serializer.KryoSerializer
spark.shuffle.io.preferDirectBufs false
spark.sql.catalogImplementation   hive
spark.sql.shuffle.partitions  5000
spark.yarn.driver.memoryOverhead  1536
spark.yarn.executor.memoryOverhead4096
spark.core.connection.ack.wait.timeout600s
spark.scheduler.maxRegisteredResourcesWaitingTime 15s
spark.sql.hive.filesourcePartitionFileCacheSize   524288000


spark.dynamicAllocation.executorIdleTimeout   3s
spark.dynamicAllocation.enabled   true
spark.hadoop.yarn.timeline-service.enabledfalse
spark.shuffle.service.enabled true
spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * 
-Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*

Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
using latest version. But we started facing DirectBuffer outOfMemory error and 
exceeding memory limits for executor memoryOverhead issue. To fix that we 
started tweaking multiple properties but still issue persists. Relevant 
information is shared below

Please let me any other details is requried,

Snapshot for DirectMemory Error Stacktrace :- 
10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
shuffleId=7, mapId=141, reduceId=3372, message=
org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) 
of direct memory (used: 1073699840, max: 1073741824)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon

[jira] [Commented] (SPARK-22259) hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [5, 28, 21, 12]

2017-10-12 Thread Kaushal Prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202052#comment-16202052
 ] 

Kaushal Prajapati commented on SPARK-22259:
---

Seem like file which you are using is not parquet.

>  hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file. 
> expected magic number at tail [80, 65, 82, 49] but found [5, 28, 21, 12]
> --
>
> Key: SPARK-22259
> URL: https://issues.apache.org/jira/browse/SPARK-22259
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
> Environment: spark1.6.1,YARN2.6.0
>Reporter: Liu Dinghua
>
> my codes with errors  are as follow:
> sqlContext.read.parquet("hdfs://HdfsHA/logrep/1/sspstatistic/")
> java.io.IOException: Could not read footer: java.lang.RuntimeException: 
> hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file (too 
> small)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$27.apply(ParquetRelation.scala:786)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$27.apply(ParquetRelation.scala:775)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: 
> hdfs://HdfsHA/logrep/1/sspstatistic/_metadata is not a Parquet file (too 
> small)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:412)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> 17/10/12 10:41:04 INFO scheduler.TaskSetManager: Starting task 1.1 in stage 
> 1.0 (TID 4, slave05, partition 1,PROCESS_LOCAL, 968186 bytes)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19908) Direct buffer memory OOM should not cause stage retries.

2017-07-20 Thread Kaushal Prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095203#comment-16095203
 ] 

Kaushal Prajapati commented on SPARK-19908:
---

[~bojanbabic] This error comes when your kryobuffer memory exceed default limit 
(64m). Increase your buffer memory using this property 
{noformat}
spark.kryoserializer.buffer.max=64m
{noformat}


> Direct buffer memory OOM should not cause stage retries.
> 
>
> Key: SPARK-19908
> URL: https://issues.apache.org/jira/browse/SPARK-19908
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Priority: Minor
>
> Currently if there is  java.lang.OutOfMemoryError: Direct buffer memory, the 
> exception will be changed to FetchFailedException, causing stage retries.
> org.apache.spark.shuffle.FetchFailedException: Direct buffer memory
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:40)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692)
>   at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854)
>   at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887)
>   at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:658)
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645)
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:212)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>   at 
> io.netty.buffer.AbstractByteB

[jira] [Commented] (SPARK-21316) Dataset Union output is not consistent with the column sequence

2017-07-06 Thread Kaushal Prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16076337#comment-16076337
 ] 

Kaushal Prajapati commented on SPARK-21316:
---

[~dongjoon] It works when the column names are specified in alphabetical order.

What if my column names and schema are not in sync with the order, in that case 
the output again changes. Also in such cases it'll be too cumbersome to handle 
column sequence and their alphabetical order. If I change the 'name' to 'a' and 
'age' to 'b', the above code works as the columns now are in alphabetical order.

Adding to this, when both my datasets are of the same schema(Person.class), 
then why the column order should be even considered while talking the union. 
According to my understanding It should not be considered. 

{code:java}
ds1.select("name","age").as(Encoders.bean(Person.class)).union(ds2).show();
{code}
In above snippet, I'm creating a dataset of rows using column selection and 
then again converting back to Person.class schema. So the order should not 
matter.

> Dataset Union output is not consistent with the column sequence
> ---
>
> Key: SPARK-21316
> URL: https://issues.apache.org/jira/browse/SPARK-21316
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Kaushal Prajapati
>  Labels: patch
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> if i take union of 2 datasets with similar schema, the output should remain 
> same even if i change the sequence of columns while creating the dataset. 
> i am attaching the code snippet for details.
> {code:java}
> public class Person{
>   public String name;
>   public String age;
>   public Person(String name, String age) {
> this.name = name;
> this.age = age;
>   }
>   public String getName() {return name;}
>   public void setName(String name) {this.name = name;}
>   public String getAge() {return age;}
>   public void setAge(String age) {this.age = age;}
> }
> {code}
> {code:java}
> public class Test {
>   public static void main(String arg[]) throws Exception {
> SparkSession spark = SparkConnection.getSpark();
> List list1 = new ArrayList<>();
> list1.add(new Person("kaushal", "25"));
> list1.add(new Person("aman", "26"));
> List list2 = new ArrayList<>();
> list2.add(new Person("sapan", "25"));
> list2.add(new Person("yati", "26"));
> Dataset ds1 = spark.createDataset(list1, 
> Encoders.bean(Person.class));
> Dataset ds2 = spark.createDataset(list2, 
> Encoders.bean(Person.class));
> ds1.show();
> ds2.show();
> 
> ds1.select("name","age").as(Encoders.bean(Person.class)).union(ds2).show();
>   }
> }
> {code}
> output :-
> {code:java}
> +---+---+
> |age|   name|
> +---+---+
> | 25|kaushal|
> | 26|   aman|
> +---+---+
> +---+-+
> |age| name|
> +---+-+
> | 25|sapan|
> | 26| yati|
> +---+-+
> +---+-+
> |   name|  age|
> +---+-+
> |kaushal|   25|
> |   aman|   26|
> | 25|sapan|
> | 26| yati|
> +---+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21316) Dataset Union output is not consistent with the column sequence

2017-07-05 Thread Kaushal Prajapati (JIRA)
Kaushal Prajapati created SPARK-21316:
-

 Summary: Dataset Union output is not consistent with the column 
sequence
 Key: SPARK-21316
 URL: https://issues.apache.org/jira/browse/SPARK-21316
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, SQL
Affects Versions: 2.1.0
Reporter: Kaushal Prajapati
Priority: Critical


if i take union of 2 datasets with similar schema, the output should remain 
same even if i change the sequence of columns while creating the dataset. 

i am attaching the code snippet for details.

{code:java}
public class Person{
  public String name;
  public String age;

  public Person(String name, String age) {
this.name = name;
this.age = age;
  }

  public String getName() {return name;}
  public void setName(String name) {this.name = name;}
  public String getAge() {return age;}
  public void setAge(String age) {this.age = age;}
}
{code}


{code:java}
public class Test {
  public static void main(String arg[]) throws Exception {
SparkSession spark = SparkConnection.getSpark();

List list1 = new ArrayList<>();
list1.add(new Person("kaushal", "25"));
list1.add(new Person("aman", "26"));

List list2 = new ArrayList<>();
list2.add(new Person("sapan", "25"));
list2.add(new Person("yati", "26"));

Dataset ds1 = spark.createDataset(list1, 
Encoders.bean(Person.class));
Dataset ds2 = spark.createDataset(list2, 
Encoders.bean(Person.class));
ds1.show();
ds2.show();
ds1.select("name","age").as(Encoders.bean(Person.class)).union(ds2).show();
  }
}
{code}

output :-

{code:java}
+---+---+
|age|   name|
+---+---+
| 25|kaushal|
| 26|   aman|
+---+---+

+---+-+
|age| name|
+---+-+
| 25|sapan|
| 26| yati|
+---+-+

+---+-+
|   name|  age|
+---+-+
|kaushal|   25|
|   aman|   26|
| 25|sapan|
| 26| yati|
+---+-+
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19908) Direct buffer memory OOM should not cause stage retries.

2017-06-29 Thread Kaushal Prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068106#comment-16068106
 ] 

Kaushal Prajapati commented on SPARK-19908:
---

[~zhanzhang] can you plz share some example code for which you are getting this 
error?

> Direct buffer memory OOM should not cause stage retries.
> 
>
> Key: SPARK-19908
> URL: https://issues.apache.org/jira/browse/SPARK-19908
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Priority: Minor
>
> Currently if there is  java.lang.OutOfMemoryError: Direct buffer memory, the 
> exception will be changed to FetchFailedException, causing stage retries.
> org.apache.spark.shuffle.FetchFailedException: Direct buffer memory
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:40)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692)
>   at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854)
>   at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887)
>   at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:658)
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645)
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:212)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>   at 
> io.netty.buffer.AbstractByteBufAll

[jira] [Commented] (SPARK-19289) UnCache Dataset using Name

2017-01-22 Thread Kaushal Prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833983#comment-15833983
 ] 

Kaushal Prajapati commented on SPARK-19289:
---

[~srowen], yes you are right, names are not necessarily unique, then if we are 
able to list all the datasets like the same that we are doing  for RDDs using 
*sc.getPersistentRDDs* to list all the RDDs, it would be useful because if i am 
maintaining the unique names for Datasets in a application, i can easily 
uncache it or if i have a group of datasets with the same name, i can also 
uncache all the datasets using this feature.

> UnCache Dataset using Name
> --
>
> Key: SPARK-19289
> URL: https://issues.apache.org/jira/browse/SPARK-19289
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Kaushal Prajapati
>Priority: Minor
>  Labels: features
>
> We can Cache and Uncache any table using its name in Spark Sql.
> {code}
> df.createTempView("myTable")
> sqlContext.cacheTable("myTable")
> sqlContext.uncacheTable("myTable")
> {code}
> Likewise if it is possible to have some kind of uniqueness for names in 
> DataSets and an abstraction like the same that we have for tables. It would 
> be very useful
> {code}
> scala> val df = sc.range(1,1000).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> df.setName("MyDataset")
> res0: df.type = MyDataset
> scala> df.cache
> res1: df.type = MyDataset
> sqlContext.getDataSet("MyDataset")
> sqlContext.uncacheDataSet("MyDataset")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19289) UnCache Dataset using Name

2017-01-19 Thread Kaushal Prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaushal Prajapati updated SPARK-19289:
--
External issue ID: Related to -  
https://issues.apache.org/jira/browse/SPARK-8480  (was: 
https://issues.apache.org/jira/browse/SPARK-8480)

> UnCache Dataset using Name
> --
>
> Key: SPARK-19289
> URL: https://issues.apache.org/jira/browse/SPARK-19289
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Kaushal Prajapati
>Priority: Minor
>  Labels: features
>
> We can Cache and Uncache any table using its name in Spark Sql.
> {code}
> df.createTempView("myTable")
> sqlContext.cacheTable("myTable")
> sqlContext.uncacheTable("myTable")
> {code}
> Likewise if it is possible to have some kind of uniqueness for names in 
> DataSets and an abstraction like the same that we have for tables. It would 
> be very useful
> {code}
> scala> val df = sc.range(1,1000).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> df.setName("MyDataset")
> res0: df.type = MyDataset
> scala> df.cache
> res1: df.type = MyDataset
> sqlContext.getDataSet("MyDataset")
> sqlContext.uncacheDataSet("MyDataset")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19289) UnCache Dataset using Name

2017-01-19 Thread Kaushal Prajapati (JIRA)
Kaushal Prajapati created SPARK-19289:
-

 Summary: UnCache Dataset using Name
 Key: SPARK-19289
 URL: https://issues.apache.org/jira/browse/SPARK-19289
 Project: Spark
  Issue Type: Wish
  Components: Spark Core, SQL
Affects Versions: 2.1.0
Reporter: Kaushal Prajapati
Priority: Minor


We can Cache and Uncache any table using its name in Spark Sql.

{code}
df.createTempView("myTable")
sqlContext.cacheTable("myTable")
sqlContext.uncacheTable("myTable")
{code}

Likewise if it is possible to have some kind of uniqueness for names in 
DataSets and an abstraction like the same that we have for tables. It would be 
very useful

{code}
scala> val df = sc.range(1,1000).toDF
df: org.apache.spark.sql.DataFrame = [value: bigint]

scala> df.setName("MyDataset")
res0: df.type = MyDataset

scala> df.cache
res1: df.type = MyDataset

sqlContext.getDataSet("MyDataset")
sqlContext.uncacheDataSet("MyDataset")
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8480) Add setName for Dataframe

2017-01-18 Thread Kaushal Prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827980#comment-15827980
 ] 

Kaushal Prajapati commented on SPARK-8480:
--

Yes [~emlyn], Its correct that the name is not identical, but if it possible to 
have some kind of uniqueness for names or an abstraction like the same that we 
have for tables.

{code}
df.createTempView("myTable")
sqlContext.cacheTable("myTable")
sqlContext.uncacheTable("myTable")
{code} 

Same like for Datasets, It would be very useful
{code}
scala> val df = sc.range(1,1000).toDF
df: org.apache.spark.sql.DataFrame = [value: bigint]

scala> df.setName("MyDataset")
res0: df.type = MyDataset

scala> df.cache
res1: df.type = MyDataset

sqlContext.getDataSet("MyDataset")
sqlContext.uncacheDataSet("MyDataset")
{code} 

> Add setName for Dataframe
> -
>
> Key: SPARK-8480
> URL: https://issues.apache.org/jira/browse/SPARK-8480
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Peter Rudenko
>Priority: Minor
>
> Rdd has a method setName, so in spark UI, it's more easily to understand 
> what's this cache for. E.g. ("data for LogisticRegression model", etc.). 
> Would be nice to have the same method for Dataframe, since it displays a 
> logical schema, in cache page, which could be quite big.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8480) Add setName for Dataframe

2017-01-17 Thread Kaushal Prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827515#comment-15827515
 ] 

Kaushal Prajapati commented on SPARK-8480:
--

For example, Using SparkContext we can get all cached Rdds

{code:title=Code|borderStyle=solid}
scala> val rdd = sc.range(1,1000)
scala> rdd.setName("myRdd")
scala> rdd.cache
scala> rdd.count

scala> sc.getPersistentRDDs.foreach(println)
(9,kaushal MapPartitionsRDD[9] at rdd at :27)
(11,myRdd MapPartitionsRDD[11] at range at :27)

sc.getPersistentRDDs.filter(_._2.name == "myRdd").foreach(_._2.unpersist())

scala> sc.getPersistentRDDs.foreach(println)
(9,kaushal MapPartitionsRDD[9] at rdd at :27)
{code}

And we can unpersist any Rdd with valid name. 
Likewise same in DataSet, if we will able to list all cached DataSets with 
corresponding names then it will be good option to unpersist any DataSet using 
particular name.

> Add setName for Dataframe
> -
>
> Key: SPARK-8480
> URL: https://issues.apache.org/jira/browse/SPARK-8480
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Peter Rudenko
>Priority: Minor
>
> Rdd has a method setName, so in spark UI, it's more easily to understand 
> what's this cache for. E.g. ("data for LogisticRegression model", etc.). 
> Would be nice to have the same method for Dataframe, since it displays a 
> logical schema, in cache page, which could be quite big.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8480) Add setName for Dataframe

2017-01-17 Thread Kaushal Prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15825745#comment-15825745
 ] 

Kaushal Prajapati commented on SPARK-8480:
--

Nice feature, but it better to have one more option to unpersist my DataFrame 
or DataSet using the same name.

> Add setName for Dataframe
> -
>
> Key: SPARK-8480
> URL: https://issues.apache.org/jira/browse/SPARK-8480
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Peter Rudenko
>Priority: Minor
>
> Rdd has a method setName, so in spark UI, it's more easily to understand 
> what's this cache for. E.g. ("data for LogisticRegression model", etc.). 
> Would be nice to have the same method for Dataframe, since it displays a 
> logical schema, in cache page, which could be quite big.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13111) Spark UI is showing negative number of sessions

2016-02-01 Thread Kaushal Prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaushal Prajapati updated SPARK-13111:
--
Attachment: spark_ui_session.jpg

> Spark UI is showing negative number of sessions
> ---
>
> Key: SPARK-13111
> URL: https://issues.apache.org/jira/browse/SPARK-13111
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Kaushal Prajapati
> Attachments: spark_ui_session.jpg
>
>
> Number of Session is coming in negative in Spark Sql -> JDBC/ODBC Server



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13111) Spark UI is showing negative number of sessions

2016-02-01 Thread Kaushal Prajapati (JIRA)
Kaushal Prajapati created SPARK-13111:
-

 Summary: Spark UI is showing negative number of sessions
 Key: SPARK-13111
 URL: https://issues.apache.org/jira/browse/SPARK-13111
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.5.0
Reporter: Kaushal Prajapati


Number of Session is coming in negative in Spark Sql -> JDBC/ODBC Server



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9345) Failure to cleanup on exceptions causes persistent I/O problems later on

2015-07-27 Thread Kaushal Prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642648#comment-14642648
 ] 

Kaushal Prajapati commented on SPARK-9345:
--

I think this problem is related to hive-site.xml file. Application context is 
not able to find out that file or properties.

> Failure to cleanup on exceptions causes persistent I/O problems later on
> 
>
> Key: SPARK-9345
> URL: https://issues.apache.org/jira/browse/SPARK-9345
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.1, 1.4.0, 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>Priority: Minor
>
> When using spark-shell in local mode, I've observed the following behavior on 
> a number of nodes:
> # Some operation generates an exception related to Spark SQL processing via 
> {{HiveContext}}.
> # From that point on, nothing could be written to Hive with {{saveAsTable}}.
> # Another identically-configured version of Spark on the same machine may not 
> exhibit the problem initially but, with enough exceptions, it will start 
> exhibiting the problem also.
> # A new identically-configured installation of the same version on the same 
> machine would exhibit the problem.
> The error is always related to inability to create a temporary folder on HDFS:
> {code}
> 15/07/25 16:03:35 ERROR InsertIntoHadoopFsRelation: Aborting task.
> java.io.IOException: Mkdirs failed to create 
> file:/user/hive/warehouse/test/_temporary/0/_temporary/attempt_201507251603_0001_m_01_0
>  (exists=false, cwd=file:/home/ubuntu)
>   at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
>   at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
>   at parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:154)
>   at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279)
>   at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
>   at 
> org.apache.spark.sql.parquet.ParquetOutputWriter.(newParquet.scala:83)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:229)
>   at 
> org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:470)
>   at 
> org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:360)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:172)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> ...
> {code}
> The behavior does not seem related to HDFS as it persists even if the HDFS 
> volume is reformatted. 
> The behavior is difficult to reproduce reliably but consistently observable 
> with sufficient Spark SQL experimentation (dozens of exceptions arising from 
> Spark SQL processing). 
> The likelihood of this happening goes up substantially if some Spark SQL 
> operation runs out of memory, which suggests
> that the problem is related to cleanup.
> In this gist ([https://gist.github.com/ssimeonov/72a64947bc33628d2d11]) you 
> can see how on the same machine, identically configured 1.3.1 and 1.4.1 
> versions sharing the same HDFS system and Hive metastore, behave differently. 
> 1.3.1 can write to Hive. 1.4.1 cannot. The behavior started happening on 
> 1.4.1 after an out of memory exception on a large job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org