[jira] [Assigned] (SPARK-5972) Cache residuals for GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5972: --- Assignee: (was: Apache Spark) > Cache residuals for GradientBoostedTrees during training > > > Key: SPARK-5972 > URL: https://issues.apache.org/jira/browse/SPARK-5972 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > In gradient boosting, the current model's prediction is re-computed for each > training instance on every iteration. The current residual (cumulative > prediction of previously trained trees in the ensemble) should be cached. > That could reduce both computation (only computing the prediction of the most > recently trained tree) and communication (only sending the most recently > trained tree to the workers). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5972) Cache residuals for GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5972: --- Assignee: Apache Spark > Cache residuals for GradientBoostedTrees during training > > > Key: SPARK-5972 > URL: https://issues.apache.org/jira/browse/SPARK-5972 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > In gradient boosting, the current model's prediction is re-computed for each > training instance on every iteration. The current residual (cumulative > prediction of previously trained trees in the ensemble) should be cached. > That could reduce both computation (only computing the prediction of the most > recently trained tree) and communication (only sending the most recently > trained tree to the workers). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6676) Add hadoop 2.4+ for profiles in POM.xml
Zhang, Liye created SPARK-6676: -- Summary: Add hadoop 2.4+ for profiles in POM.xml Key: SPARK-6676 URL: https://issues.apache.org/jira/browse/SPARK-6676 Project: Spark Issue Type: Improvement Components: Build, Tests Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor support *-Phadoop-2.5* and *-Phadoop-2.6* when building and testing Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6676) Add hadoop 2.4+ for profiles in POM.xml
[ https://issues.apache.org/jira/browse/SPARK-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392466#comment-14392466 ] Apache Spark commented on SPARK-6676: - User 'liyezhang556520' has created a pull request for this issue: https://github.com/apache/spark/pull/5331 > Add hadoop 2.4+ for profiles in POM.xml > --- > > Key: SPARK-6676 > URL: https://issues.apache.org/jira/browse/SPARK-6676 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 1.3.0 >Reporter: Zhang, Liye >Priority: Minor > > support *-Phadoop-2.5* and *-Phadoop-2.6* when building and testing Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6676) Add hadoop 2.4+ for profiles in POM.xml
[ https://issues.apache.org/jira/browse/SPARK-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6676: --- Assignee: Apache Spark > Add hadoop 2.4+ for profiles in POM.xml > --- > > Key: SPARK-6676 > URL: https://issues.apache.org/jira/browse/SPARK-6676 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 1.3.0 >Reporter: Zhang, Liye >Assignee: Apache Spark >Priority: Minor > > support *-Phadoop-2.5* and *-Phadoop-2.6* when building and testing Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6676) Add hadoop 2.4+ for profiles in POM.xml
[ https://issues.apache.org/jira/browse/SPARK-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6676: --- Assignee: (was: Apache Spark) > Add hadoop 2.4+ for profiles in POM.xml > --- > > Key: SPARK-6676 > URL: https://issues.apache.org/jira/browse/SPARK-6676 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 1.3.0 >Reporter: Zhang, Liye >Priority: Minor > > support *-Phadoop-2.5* and *-Phadoop-2.6* when building and testing Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392474#comment-14392474 ] Sudharma Puranik commented on SPARK-2243: - [~jahubba] : Running on the seperate JVMs is not a workaround but I its about sharing the SparkContext understanding your processor and memory. Nonetheless Running multiple sparkcontexts on JVM is total different aspect which they are still skeptical or whatsoever the reason. But to set the things straight, its not workaround. > Support multiple SparkContexts in the same JVM > -- > > Key: SPARK-2243 > URL: https://issues.apache.org/jira/browse/SPARK-2243 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Spark Core >Affects Versions: 0.7.0, 1.0.0, 1.1.0 >Reporter: Miguel Angel Fernandez Diaz > > We're developing a platform where we create several Spark contexts for > carrying out different calculations. Is there any restriction when using > several Spark contexts? We have two contexts, one for Spark calculations and > another one for Spark Streaming jobs. The next error arises when we first > execute a Spark calculation and, once the execution is finished, a Spark > Streaming job is launched: > {code} > 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) > at > org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) > at > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to > java.io.FileNotFoundException > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(Delega
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392475#comment-14392475 ] Sudharma Puranik commented on SPARK-2243: - [~jahubba] : Running on the seperate JVMs is not a workaround but I its about sharing the SparkContext understanding your processor and memory. Nonetheless Running multiple sparkcontexts on JVM is total different aspect which they are still skeptical or whatsoever the reason. But to set the things straight, its not workaround. > Support multiple SparkContexts in the same JVM > -- > > Key: SPARK-2243 > URL: https://issues.apache.org/jira/browse/SPARK-2243 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Spark Core >Affects Versions: 0.7.0, 1.0.0, 1.1.0 >Reporter: Miguel Angel Fernandez Diaz > > We're developing a platform where we create several Spark contexts for > carrying out different calculations. Is there any restriction when using > several Spark contexts? We have two contexts, one for Spark calculations and > another one for Spark Streaming jobs. The next error arises when we first > execute a Spark calculation and, once the execution is finished, a Spark > Streaming job is launched: > {code} > 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) > at > org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) > at > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to > java.io.FileNotFoundException > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(Delega
[jira] [Issue Comment Deleted] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudharma Puranik updated SPARK-2243: Comment: was deleted (was: [~jahubba] : Running on the seperate JVMs is not a workaround but I its about sharing the SparkContext understanding your processor and memory. Nonetheless Running multiple sparkcontexts on JVM is total different aspect which they are still skeptical or whatsoever the reason. But to set the things straight, its not workaround.) > Support multiple SparkContexts in the same JVM > -- > > Key: SPARK-2243 > URL: https://issues.apache.org/jira/browse/SPARK-2243 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Spark Core >Affects Versions: 0.7.0, 1.0.0, 1.1.0 >Reporter: Miguel Angel Fernandez Diaz > > We're developing a platform where we create several Spark contexts for > carrying out different calculations. Is there any restriction when using > several Spark contexts? We have two contexts, one for Spark calculations and > another one for Spark Streaming jobs. The next error arises when we first > execute a Spark calculation and, once the execution is finished, a Spark > Streaming job is launched: > {code} > 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) > at > org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) > at > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to > java.io.FileNotFoundException > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.ja
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392519#comment-14392519 ] Sudharma Puranik commented on SPARK-2243: - [~sowen] : My reply was for Jason where he mentioned about workaround. And yes I mean running logically distinct apps with logically distinct configurations only in terms of cores. Yes you are right in regard to running seperate process. Well , again, Spark per se, process is nothing but another SparkContext, which again has boundaries with cores and memory. :) I guess we are in unison in understanding the implications of multiple SparkContexts on same JVM vs sharing multiple sparkContexts across JVMs. > Support multiple SparkContexts in the same JVM > -- > > Key: SPARK-2243 > URL: https://issues.apache.org/jira/browse/SPARK-2243 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Spark Core >Affects Versions: 0.7.0, 1.0.0, 1.1.0 >Reporter: Miguel Angel Fernandez Diaz > > We're developing a platform where we create several Spark contexts for > carrying out different calculations. Is there any restriction when using > several Spark contexts? We have two contexts, one for Spark calculations and > another one for Spark Streaming jobs. The next error arises when we first > execute a Spark calculation and, once the execution is finished, a Spark > Streaming job is launched: > {code} > 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) > at > org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) > at > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to > java.io.FileNotFoundException > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorIm
[jira] [Resolved] (SPARK-6676) Add hadoop 2.4+ for profiles in POM.xml
[ https://issues.apache.org/jira/browse/SPARK-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6676. -- Resolution: Won't Fix This is already supported by the hadoop-2.4 profile, which mean "2.4+" > Add hadoop 2.4+ for profiles in POM.xml > --- > > Key: SPARK-6676 > URL: https://issues.apache.org/jira/browse/SPARK-6676 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 1.3.0 >Reporter: Zhang, Liye >Priority: Minor > > support *-Phadoop-2.5* and *-Phadoop-2.6* when building and testing Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6677) pyspark.sql nondeterministic issue with row fields
Stefano Parmesan created SPARK-6677: --- Summary: pyspark.sql nondeterministic issue with row fields Key: SPARK-6677 URL: https://issues.apache.org/jira/browse/SPARK-6677 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0 Environment: spark version: spark-1.3.0-bin-hadoop2.4 python version: Python 2.7.6 operating system: MacOS, x86_64 x86_64 x86_64 GNU/Linux Reporter: Stefano Parmesan The following issue happens only when running pyspark in the python interpreter, it works correctly with spark-submit. Reading two json files containing objects with a different structure leads sometimes to the definition of wrong Rows, where the fields of a file are used for the other one. I was able to write a sample code that reproduce this issue one out of three times; the code snippet is available at the following link, together with some (very simple) data samples: https://gist.github.com/armisael/e08bb4567d0a11efe2db -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392499#comment-14392499 ] Sean Owen commented on SPARK-2243: -- The best thing to do is state your use case. Not a workaround for what purpose? Is this just because you think it would save some memory? is it about sharing some non-Spark resource in memory? separating configuration? If you want to run logically distinct apps with logically distinct configuration, IMHO you should be running separate processes as a matter of good design. Yep, you'll end up spending more memory. Sharing non-Spark state makes more sense. I think you can and even should design to not assume these separate jobs share non-persistent state outside of Spark, but that is a legitimate question. But certainly you _can_ do this. I don't think this makes whole classes of usage impossible or even difficult. Which is why I wonder what has "no workaround". Note that sharing resources in the sense of sharing RDDs actually *requires* sharing a {{SparkContext}}. I think cooperating Spark tasks are much more likely to want to cooperate this way? > Support multiple SparkContexts in the same JVM > -- > > Key: SPARK-2243 > URL: https://issues.apache.org/jira/browse/SPARK-2243 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Spark Core >Affects Versions: 0.7.0, 1.0.0, 1.1.0 >Reporter: Miguel Angel Fernandez Diaz > > We're developing a platform where we create several Spark contexts for > carrying out different calculations. Is there any restriction when using > several Spark contexts? We have two contexts, one for Spark calculations and > another one for Spark Streaming jobs. The next error arises when we first > execute a Spark calculation and, once the execution is finished, a Spark > Streaming job is launched: > {code} > 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) > at > org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) > at > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss
[jira] [Resolved] (SPARK-6672) createDataFrame from RDD[Row] with UDTs cannot be saved
[ https://issues.apache.org/jira/browse/SPARK-6672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6672. --- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5329 [https://github.com/apache/spark/pull/5329] > createDataFrame from RDD[Row] with UDTs cannot be saved > --- > > Key: SPARK-6672 > URL: https://issues.apache.org/jira/browse/SPARK-6672 > Project: Spark > Issue Type: Bug > Components: MLlib, SQL >Affects Versions: 1.3.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.4.0 > > > Reported by Jaonary > (https://www.mail-archive.com/user@spark.apache.org/msg25218.html): > {code} > import org.apache.spark.mllib.linalg._ > import org.apache.spark.mllib.regression._ > val df0 = sqlContext.createDataFrame(Seq(LabeledPoint(1.0, Vectors.dense(2.0, > 3.0 > df0.save("/tmp/df0") // works > val df1 = sqlContext.createDataFrame(df0.rdd, df0.schema) > df1.save("/tmp/df1") // error > {code} > throws > {code} > 15/04/01 23:24:16 INFO DAGScheduler: Job 3 failed: runJob at > newParquet.scala:686, took 0.288304 s > org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in > stage 3.0 failed 1 times, most recent failure: Lost task 3.0 in stage 3.0 > (TID 15, localhost): java.lang.ClassCastException: > org.apache.spark.mllib.linalg.DenseVector cannot be cast to > org.apache.spark.sql.Row > at > org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:191) > at > org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:182) > at > org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171) > at > org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134) > at > parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) > at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) > at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) > at > org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:668) > at > org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:686) > at > org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:686) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:212) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392519#comment-14392519 ] Sudharma Puranik edited comment on SPARK-2243 at 4/2/15 10:36 AM: -- [~sowen] : My reply was for Jason where he mentioned about workaround. And yes I mean running logically distinct apps with logically distinct configurations only in terms of cores. Yes you are right in regard to running seperate process. Well , again, Spark per se, process is nothing but another {{SparkContext}}, which again has boundaries with cores and memory. :) I guess we are in unison in understanding the implications of multiple {{SparkContext}} on same JVM vs sharing multiple {{SparkContext}} across JVMs. was (Author: sudharma.pura...@gmail.com): [~sowen] : My reply was for Jason where he mentioned about workaround. And yes I mean running logically distinct apps with logically distinct configurations only in terms of cores. Yes you are right in regard to running seperate process. Well , again, Spark per se, process is nothing but another SparkContext, which again has boundaries with cores and memory. :) I guess we are in unison in understanding the implications of multiple SparkContexts on same JVM vs sharing multiple sparkContexts across JVMs. > Support multiple SparkContexts in the same JVM > -- > > Key: SPARK-2243 > URL: https://issues.apache.org/jira/browse/SPARK-2243 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Spark Core >Affects Versions: 0.7.0, 1.0.0, 1.1.0 >Reporter: Miguel Angel Fernandez Diaz > > We're developing a platform where we create several Spark contexts for > carrying out different calculations. Is there any restriction when using > several Spark contexts? We have two contexts, one for Spark calculations and > another one for Spark Streaming jobs. The next error arises when we first > execute a Spark calculation and, once the execution is finished, a Spark > Streaming job is launched: > {code} > 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) > at > org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) > at > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
[jira] [Created] (SPARK-6678) select count(DISTINCT C_UID) from parquetdir may be can optimize
Littlestar created SPARK-6678: - Summary: select count(DISTINCT C_UID) from parquetdir may be can optimize Key: SPARK-6678 URL: https://issues.apache.org/jira/browse/SPARK-6678 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Littlestar Priority: Minor 2.2T parquet files(5000 files total, 100 billion records, 2 billion unique C_UID). I run the following sql, may be RDD.collect is very slow select count(DISTINCT C_UID) from parquetdir select count(DISTINCT C_UID) from parquetdir collect at SparkPlan.scala:83 +details org.apache.spark.rdd.RDD.collect(RDD.scala:813) org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83) org.apache.spark.sql.DataFrame.collect(DataFrame.scala:815) org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:178) org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:606) org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) java.security.AccessController.doPrivileged(Native Method) javax.security.auth.Subject.doAs(Subject.java:415) org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) com.sun.proxy.$Proxy23.executeStatementAsync(Unknown Source) org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6679) java.lang.ClassNotFoundException on Mesos fine grained mode and input replication
[ https://issues.apache.org/jira/browse/SPARK-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ondřej Smola updated SPARK-6679: Description: Spark Streaming 1.3.0, Mesos 0.21.1 - Only when using fine grained mode and receiver input replication (StorageLevel.MEMORY_ONLY_2, StorageLevel.MEMORY_AND_DISK_2). When using coarse grained mode it works. When not using replication (StorageLevel.MEMORY_ONLY ...) it works. Error: 15/03/26 14:50:00 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC id 7178767328921933569 java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:344) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:65) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:88) at org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:65) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:124) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:97) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:91) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:44) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) was: Spark Streaming 1.3.0, Mesos 0.21.1 - When using fine grained mode and receiver input replication ( StorageLevel.MEMORY_ONLY_2, StorageLevel.MEMORY_AND_DISK_2) following error is thrown. When using coarse grained mode it works. When not using replication (StorageLevel.MEMORY_ONLY ...) it works. Error: 15/03/26 14:50:00 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC id 7178767328921933569 java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:344) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:65) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerial
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392616#comment-14392616 ] sam commented on SPARK-2243: [~srowen] The real issue here and use case, is to be able to change configuration during execution without serialization concerns. It's common for different steps of a job to require different amounts of various caches, e.g. part one caches an RDD into memory to reduce cost of iterating over it, part two takes the result and does some kind of big shuffle. The second part needs a big shuffle memory fraction while the first part might need a large cache memory fraction. In fact I have jobs where I need 0.8 for a big shuffle, 0.8 for some caching, then set both to 0 for a part that requires a ton of heap. Currently the only work around is to write out the results of each part to disk (causing serialization faff), and wrap the application in a bash script to run it repeatedly with different configuration and parameters. Creating new SparkContexts would not necessarily solve this since as pointed out one would need to be able to share RDDs across contexts. > Support multiple SparkContexts in the same JVM > -- > > Key: SPARK-2243 > URL: https://issues.apache.org/jira/browse/SPARK-2243 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Spark Core >Affects Versions: 0.7.0, 1.0.0, 1.1.0 >Reporter: Miguel Angel Fernandez Diaz > > We're developing a platform where we create several Spark contexts for > carrying out different calculations. Is there any restriction when using > several Spark contexts? We have two contexts, one for Spark calculations and > another one for Spark Streaming jobs. The next error arises when we first > execute a Spark calculation and, once the execution is finished, a Spark > Streaming job is launched: > {code} > 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) > at > org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) > at > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss
[jira] [Assigned] (SPARK-4449) specify port range in spark
[ https://issues.apache.org/jira/browse/SPARK-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4449: --- Assignee: Apache Spark > specify port range in spark > --- > > Key: SPARK-4449 > URL: https://issues.apache.org/jira/browse/SPARK-4449 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Fei Wang >Assignee: Apache Spark >Priority: Minor > > In some case, we need specify port range used in spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392646#comment-14392646 ] Chris Fregly commented on SPARK-6407: - from [~mengxr] "The online update should be implemented with GraphX or indexedrdd, which may take some time. There is no open-source solution. Try doing a survey on existing algorithms for online matrix factorization updates." > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6679) java.lang.ClassNotFoundException on Mesos fine grained mode and input replication
Ondřej Smola created SPARK-6679: --- Summary: java.lang.ClassNotFoundException on Mesos fine grained mode and input replication Key: SPARK-6679 URL: https://issues.apache.org/jira/browse/SPARK-6679 Project: Spark Issue Type: Bug Components: Mesos, Streaming Affects Versions: 1.3.0 Reporter: Ondřej Smola Spark Streaming 1.3.0, Mesos 0.21.1 - When using fine grained mode and receiver input replication ( StorageLevel.MEMORY_ONLY_2, StorageLevel.MEMORY_AND_DISK_2) following error is thrown. When using coarse grained mode it works. When not using replication (StorageLevel.MEMORY_ONLY ...) it works. Error: 15/03/26 14:50:00 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC id 7178767328921933569 java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:344) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:65) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:88) at org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:65) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:124) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:97) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:91) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:44) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4449) specify port range in spark
[ https://issues.apache.org/jira/browse/SPARK-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4449: --- Assignee: (was: Apache Spark) > specify port range in spark > --- > > Key: SPARK-4449 > URL: https://issues.apache.org/jira/browse/SPARK-4449 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Fei Wang >Priority: Minor > > In some case, we need specify port range used in spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392721#comment-14392721 ] Sean Owen commented on SPARK-2243: -- [~sams] in this particular case, can you simply set both of these limits to the maximum that you need them to be, like 0.8 in both cases? It doesn't mean you can suddenly use 160% of memory of course. But these are just upper limits on what the cache/shuffle will consume, and not _also_ inversely limits on how much other stuff can use. That is, 80% cache doesn't stop tasks or shuffle from using 100% of all memory. Yes, that means you have to manage the steps of your job carefully so that the memory hungry steps don't overlap. But you're already doing that. Of course you could also throw more resource / memory at this too, but that costs £, but then again so does your time. If you're able to use the YARN dynamic executor scaling, maybe that's mitigated to some degree (although I think it triggers based on active tasks, not memory usage; it would save resource when little of anything is running). Back to the general point: it seems like one motivation is reconfiguration, which is not quite the same as making multiple contexts; it should be easier. > Support multiple SparkContexts in the same JVM > -- > > Key: SPARK-2243 > URL: https://issues.apache.org/jira/browse/SPARK-2243 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Spark Core >Affects Versions: 0.7.0, 1.0.0, 1.1.0 >Reporter: Miguel Angel Fernandez Diaz > > We're developing a platform where we create several Spark contexts for > carrying out different calculations. Is there any restriction when using > several Spark contexts? We have two contexts, one for Spark calculations and > another one for Spark Streaming jobs. The next error arises when we first > execute a Spark calculation and, once the execution is finished, a Spark > Streaming job is launched: > {code} > 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) > at > org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) > at > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 14/06/23 16:40:08 WARN s
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392742#comment-14392742 ] Sean Owen commented on SPARK-6407: -- ALS doesn't use gradient descent, at least not enough in the sense that these linear models do that you could reuse the implementation. I am accustomed to fold-in for approximate streaming updates to an ALS model, but yes it does kind of need to mutate some RDD-based data structured efficiently like an IndexedRDD. Although the idea is simple I also don't know of good theoretical approaches and have just made up reasonable heuristics in the past. > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6680) Be able to specifie IP for spark-shell(spark driver) blocker for Docker integration
Egor Pakhomov created SPARK-6680: Summary: Be able to specifie IP for spark-shell(spark driver) blocker for Docker integration Key: SPARK-6680 URL: https://issues.apache.org/jira/browse/SPARK-6680 Project: Spark Issue Type: New Feature Components: Deploy Affects Versions: 1.3.0 Environment: Docker. Reporter: Egor Pakhomov Priority: Blocker Suppose I have 3 docker containers - spark_master, spark_worker and spark_shell. In docker for public IP of this container there is an alias like "fgsdfg454534". It only visible in this container. When spark use it for communication other containers receive this alias and don't know what to do with it. Thats why I used SPARK_LOCAL_IP for master and worker. But it doesn't work for spark driver(for spark shell - other types of drivers I haven't try). Spark driver sent everyone "fgsdfg454534" alias about itself and then nobody can address it. I've overcome it in https://github.com/epahomov/docker-spark, but it would be better if it would be solved on spark code level. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392780#comment-14392780 ] sam commented on SPARK-2243: // sam in this particular case, can you simply set both of these limits to the maximum that you need them to be, like 0.8 in both cases? It doesn't mean you can suddenly use 160% of memory of course. But these are just upper limits on what the cache/shuffle will consume, and not also inversely limits on how much other stuff can use. That is, 80% cache doesn't stop tasks or shuffle from using 100% of all memory. // Wat?! Man that *really* needs to be made clear in the documentation, it currently says "Fraction of Java heap to use" not "Fraction of Java heap to limit use by". Plus an extra sentence somewhere saying "these need not add up to a number less than 1.0" would clarify. In that case I cannot think of a use case for multiple contexts and I'll withdraw my vote since I do not currently have a use case for changing the other configuration settings on the fly (they are mainly timeouts and such and such). > Support multiple SparkContexts in the same JVM > -- > > Key: SPARK-2243 > URL: https://issues.apache.org/jira/browse/SPARK-2243 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Spark Core >Affects Versions: 0.7.0, 1.0.0, 1.1.0 >Reporter: Miguel Angel Fernandez Diaz > > We're developing a platform where we create several Spark contexts for > carrying out different calculations. Is there any restriction when using > several Spark contexts? We have two contexts, one for Spark calculations and > another one for Spark Streaming jobs. The next error arises when we first > execute a Spark calculation and, once the execution is finished, a Spark > Streaming job is launched: > {code} > 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) > at > org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) > at > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to > java.io.FileNotFoundException >
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392798#comment-14392798 ] sam commented on SPARK-2243: What I would suggest though, is putting an `assert`/`requires` somewhere to throw a meaningful exception when two SparkContexts are created. The current ST is rather baffling. IME it seems FNF exceptions are the catch all exceptions of Spark - pretty much every exception can cause a FNF exception. > Support multiple SparkContexts in the same JVM > -- > > Key: SPARK-2243 > URL: https://issues.apache.org/jira/browse/SPARK-2243 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Spark Core >Affects Versions: 0.7.0, 1.0.0, 1.1.0 >Reporter: Miguel Angel Fernandez Diaz > > We're developing a platform where we create several Spark contexts for > carrying out different calculations. Is there any restriction when using > several Spark contexts? We have two contexts, one for Spark calculations and > another one for Spark Streaming jobs. The next error arises when we first > execute a Spark calculation and, once the execution is finished, a Spark > Streaming job is launched: > {code} > 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) > at > org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) > at > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to > java.io.FileNotFoundException > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.ref
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392804#comment-14392804 ] Jason Hubbard commented on SPARK-2243: -- I don't think programmatically spinning up JVMs is particularly easy to get correct either, nor do I think it is an elegant solution though I'm sure others might disagree. In this case you will have to pay attention to may different things, including resource over utilization, handling environment variables, trying to message or share resources between JVMs, and JVM forking has had many issues in the past. I guess I get a bit confused on the use of SparkContext anyway, if the intent isn't to have multiple, then why not force it to be a singleton and possibly a different name that would imply only a single instance should exist. > Support multiple SparkContexts in the same JVM > -- > > Key: SPARK-2243 > URL: https://issues.apache.org/jira/browse/SPARK-2243 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Spark Core >Affects Versions: 0.7.0, 1.0.0, 1.1.0 >Reporter: Miguel Angel Fernandez Diaz > > We're developing a platform where we create several Spark contexts for > carrying out different calculations. Is there any restriction when using > several Spark contexts? We have two contexts, one for Spark calculations and > another one for Spark Streaming jobs. The next error arises when we first > execute a Spark calculation and, once the execution is finished, a Spark > Streaming job is launched: > {code} > 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) > at > org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) > at > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to > java.io.FileNotFoundException > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at >
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392848#comment-14392848 ] sam commented on SPARK-2243: Yup, a singleton would make sense, it's creation is side effecting, so one might as well have a method "setSparkContextConf(conf: SparkConf)" for setting the conf rather than creating a new SparkContext. There is no point in using an pure FP pattern when the domain isn't pure. > Support multiple SparkContexts in the same JVM > -- > > Key: SPARK-2243 > URL: https://issues.apache.org/jira/browse/SPARK-2243 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Spark Core >Affects Versions: 0.7.0, 1.0.0, 1.1.0 >Reporter: Miguel Angel Fernandez Diaz > > We're developing a platform where we create several Spark contexts for > carrying out different calculations. Is there any restriction when using > several Spark contexts? We have two contexts, one for Spark calculations and > another one for Spark Streaming jobs. The next error arises when we first > execute a Spark calculation and, once the execution is finished, a Spark > Streaming job is launched: > {code} > 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) > at > org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) > at > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to > java.io.FileNotFoundException > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Me
[jira] [Comment Edited] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392848#comment-14392848 ] sam edited comment on SPARK-2243 at 4/2/15 3:37 PM: Yup, a singleton would make sense, it's creation is side effecting, so one might as well have a method "initSparkConf(conf: SparkConf)" for initialization and setting the conf rather than creating a new SparkContext. There is no point in using an pure FP pattern when the domain isn't pure. was (Author: sams): Yup, a singleton would make sense, it's creation is side effecting, so one might as well have a method "setSparkContextConf(conf: SparkConf)" for setting the conf rather than creating a new SparkContext. There is no point in using an pure FP pattern when the domain isn't pure. > Support multiple SparkContexts in the same JVM > -- > > Key: SPARK-2243 > URL: https://issues.apache.org/jira/browse/SPARK-2243 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Spark Core >Affects Versions: 0.7.0, 1.0.0, 1.1.0 >Reporter: Miguel Angel Fernandez Diaz > > We're developing a platform where we create several Spark contexts for > carrying out different calculations. Is there any restriction when using > several Spark contexts? We have two contexts, one for Spark calculations and > another one for Spark Streaming jobs. The next error arises when we first > execute a Spark calculation and, once the execution is finished, a Spark > Streaming job is launched: > {code} > 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) > at > org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) > at > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to > java.io.FileNotFoundException > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.bro
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392860#comment-14392860 ] Sean Owen commented on SPARK-2243: -- [~sams] the {{SparkContext}} constructor will throw an exception in 1.3 if you try to instantiate a second one. You can turn it off with {{spark.driver.allowMultipleContexts=true}}. Which doesn't make it 100% work but doesn't forbid it outright. I wouldn't recommend disabling it and it's off by default. I think you'd be welcome to suggest a minor doc change. I'm 90% sure I'm right on this. Or else, we'd never see JVMs running out of memory unless the cache was quite full? [~jahubba] I bet that in retrospect it would have been better to make the {{SparkContext}} uninstantiable and access it only via a factory method. Too late for that now. Some of what you're doing sounds like it would be the same even with simultaneous contexts -- you're still setting two sets of config, you're still thinking about resources. I think the forced decoupling makes things harder. I suppose decoupling has its upsides but costs in runtime complexity. That said there's also some classes of use cases that might do well to share both a JVM and {{SparkContext}}. These aren't 100% of use cases but I hope there's good news for some who find this issue and previously thought it was a blocker. > Support multiple SparkContexts in the same JVM > -- > > Key: SPARK-2243 > URL: https://issues.apache.org/jira/browse/SPARK-2243 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Spark Core >Affects Versions: 0.7.0, 1.0.0, 1.1.0 >Reporter: Miguel Angel Fernandez Diaz > > We're developing a platform where we create several Spark contexts for > carrying out different calculations. Is there any restriction when using > several Spark contexts? We have two contexts, one for Spark calculations and > another one for Spark Streaming jobs. The next error arises when we first > execute a Spark calculation and, once the execution is finished, a Spark > Streaming job is launched: > {code} > 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) > at > org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) > at > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolE
[jira] [Commented] (SPARK-5452) We are migrating Tera Data SQL to Spark SQL. Query is taking long time. Please have a look on this issue
[ https://issues.apache.org/jira/browse/SPARK-5452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392901#comment-14392901 ] Nicholas Chammas commented on SPARK-5452: - Yeah, as Sean said, this post as it stands is not suitable for JIRA. You open a JIRA when you've identified a specific problem that you can demonstrate is specific to Spark. Posting a large amount of code and configuration information like this doesn't tell anyone that there is a known problem, and no-one is going to dig for you to tell whether the performance issue is specific to Spark or caused by something else in your environment. In other words, this is not a "Spark support" site. JIRA is for tracking specific bugs or new features, and this post describes neither. > We are migrating Tera Data SQL to Spark SQL. Query is taking long time. > Please have a look on this issue > > > Key: SPARK-5452 > URL: https://issues.apache.org/jira/browse/SPARK-5452 > Project: Spark > Issue Type: Test > Components: Spark Shell >Affects Versions: 1.2.0 >Reporter: irfan > Labels: SparkSql > > Hi Team, > we are migrating TeraData SQL to Spark SQL because of complexity we have > spilted into below 4 sub-quries > and we are running through hive context > > val HIVETMP1 = hc.sql("SELECT PARTY_ACCOUNT_ID AS > PARTY_ACCOUNT_ID,LMS_ACCOUNT_ID AS LMS_ACCOUNT_ID FROM VW_PARTY_ACCOUNT WHERE > PARTY_ACCOUNT_TYPE_CODE IN('04') AND LMS_ACCOUNT_ID IS NOT NULL") > HIVETMP1.registerTempTable("VW_HIVETMP1") > val HIVETMP2 = hc.sql("SELECT PACCNT.LMS_ACCOUNT_ID AS LMS_ACCOUNT_ID, > 'NULL' AS RANDOM_PARTY_ACCOUNT_ID ,'NULL' AS MOST_RECENT_SPEND_LA > ,STXN.PARTY_ACCOUNT_ID AS MAX_SPEND_12WKS_LA ,STXN.MAX_SPEND_12WKS_LADATE > AS MAX_SPEND_12WKS_LADATE FROM VW_HIVETMP1 AS PACCNT INNER JOIN (SELECT > STXTMP.PARTY_ACCOUNT_ID AS PARTY_ACCOUNT_ID, SUM(CASE WHEN > (CAST(STXTMP.TRANSACTION_DATE AS DATE ) > > DATE_SUB(CAST(CONCAT(SUBSTRING(SYSTMP.OPTION_VAL,1,4),'-',SUBSTRING(SYSTMP.OPTION_VAL,5,2),'-',SUBSTRING(SYSTMP.OPTION_VAL,7,2)) > AS DATE),84)) THEN STXTMP.TRANSACTION_VALUE ELSE 0.00 END) AS > MAX_SPEND_12WKS_LADATE FROM VW_SHOPPING_TRANSACTION_TABLE AS STXTMP INNER > JOIN SYSTEM_OPTION_TABLE AS SYSTMP ON STXTMP.FLAG == SYSTMP.FLAG AND > SYSTMP.OPTION_NAME = 'RID' AND STXTMP.PARTY_ACCOUNT_TYPE_CODE IN('04') GROUP > BY STXTMP.PARTY_ACCOUNT_ID) AS STXN ON PACCNT.PARTY_ACCOUNT_ID = > STXN.PARTY_ACCOUNT_ID WHERE STXN.MAX_SPEND_12WKS_LADATE IS NOT NULL") > HIVETMP2.registerTempTable("VW_HIVETMP2") > val HIVETMP3 = hc.sql("SELECT LMS_ACCOUNT_ID,MAX(MAX_SPEND_12WKS_LA) AS > MAX_SPEND_12WKS_LA, 1 AS RANK FROM VW_HIVETMP2 GROUP BY LMS_ACCOUNT_ID") > HIVETMP3.registerTempTable("VW_HIVETMP3") > val HIVETMP4 = hc.sql(" SELECT PACCNT.LMS_ACCOUNT_ID,'NULL' AS > RANDOM_PARTY_ACCOUNT_ID ,'NULL' AS > MOST_RECENT_SPEND_LA,STXN.MAX_SPEND_12WKS_LA AS MAX_SPEND_12WKS_LA,1 AS RANK1 > FROM VW_HIVETMP2 AS PACCNT INNER JOIN VW_HIVETMP3 AS STXN ON > PACCNT.LMS_ACCOUNT_ID = STXN.LMS_ACCOUNT_ID AND PACCNT.MAX_SPEND_12WKS_LA = > STXN.MAX_SPEND_12WKS_LA") > HIVETMP4.registerTempTable("WT03_ACCOUNT_BHVR3") > HIVETMP4.saveAsTextFile("hdfs:/file/") > == > This query has two Group By clauses which are running on huge files(19.5GB). > And the query took 40min to get the final result. Is there any changes > required in run time environment or Configuration Setting in Spark which can > improve the query performance. > below are our Environment and configuration details: > Environment details: > No of nodes:4 > capacity on each node:62 GB RAM on each node. > Storage capacity :9TB on each node > total cores :48 > Spark Configuration: > > .set("spark.default.parallelism","64") > .set("spark.driver.maxResultSize","2G") > .set("spark.driver.memory","10g") > .set("spark.rdd.compress","true") > .set("spark.shuffle.spill.compress","true") > .set("spark.shuffle.compress","true") > .set("spark.shuffle.consolidateFiles","true/false") > .set("spark.shuffle.spill","true/false") > > Data file size : > SHOPPING_TRANSACTION 19.5GB > PARTY_ACCOUNT1.4GB > SYSTEM_OPTIONS 11.6K > please help us to resolve above issue. > Thanks, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings
[ https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392934#comment-14392934 ] Sean Owen commented on SPARK-6569: -- [~minisaw] are you going to submit a PR to reduce the log level? Cody seems OK with it. > Kafka directInputStream logs what appear to be incorrect warnings > - > > Key: SPARK-6569 > URL: https://issues.apache.org/jira/browse/SPARK-6569 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.0 > Environment: Spark 1.3.0 >Reporter: Platon Potapov >Priority: Minor > > During what appears to be normal operation of streaming from a Kafka topic, > the following log records are observed, logged periodically: > {code} > [Stage 391:==> (3 + 0) / > 4] > 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the > same as ending offset skipping raw 0 > 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the > same as ending offset skipping raw 0 > 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the > same as ending offset skipping raw 0 > {code} > * the part.fromOffset placeholder is not correctly substituted to a value > * is the condition really mandates a warning being logged? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6209) ExecutorClassLoader can leak connections after failing to load classes from the REPL class server
[ https://issues.apache.org/jira/browse/SPARK-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6209: --- Assignee: Apache Spark (was: Josh Rosen) > ExecutorClassLoader can leak connections after failing to load classes from > the REPL class server > - > > Key: SPARK-6209 > URL: https://issues.apache.org/jira/browse/SPARK-6209 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0, 1.0.3, 1.1.2, 1.2.1, 1.3.0, 1.4.0 >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Critical > Fix For: 1.3.1, 1.4.0 > > > ExecutorClassLoader does not ensure proper cleanup of network connections > that it opens. If it fails to load a class, it may leak partially-consumed > InputStreams that are connected to the REPL's HTTP class server, causing that > server to exhaust its thread pool, which can cause the entire job to hang. > Here is a simple reproduction: > With > {code} > ./bin/spark-shell --master local-cluster[8,8,512] > {code} > run the following command: > {code} > sc.parallelize(1 to 1000, 1000).map { x => > try { > Class.forName("some.class.that.does.not.Exist") > } catch { > case e: Exception => // do nothing > } > x > }.count() > {code} > This job will run 253 tasks, then will completely freeze without any errors > or failed tasks. > It looks like the driver has 253 threads blocked in socketRead0() calls: > {code} > [joshrosen ~]$ jstack 16765 | grep socketRead0 | wc > 253 759 14674 > {code} > e.g. > {code} > "qtp1287429402-13" daemon prio=5 tid=0x7f868a1c nid=0x5b03 runnable > [0x0001159bd000] >java.lang.Thread.State: RUNNABLE > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:152) > at java.net.SocketInputStream.read(SocketInputStream.java:122) > at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391) > at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141) > at > org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227) > at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1044) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280) > at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) > at > org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) > at > org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > at java.lang.Thread.run(Thread.java:745) > {code} > Jstack on the executors shows blocking in loadClass / findClass, where a > single thread is RUNNABLE and waiting to hear back from the driver and other > executor threads are BLOCKED on object monitor synchronization at > Class.forName0(). > Remotely triggering a GC on a hanging executor allows the job to progress and > complete more tasks before hanging again. If I repeatedly trigger GC on all > of the executors, then the job runs to completion: > {code} > jps | grep CoarseGra | cut -d ' ' -f 1 | xargs -I {} -n 1 -P100 jcmd {} GC.run > {code} > The culprit is a {{catch}} block that ignores all exceptions and performs no > cleanup: > https://github.com/apache/spark/blob/v1.2.0/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L94 > This bug has been present since Spark 1.0.0, but I suspect that we haven't > seen it before because it's pretty hard to reproduce. Triggering this error > requires a job with tasks that trigger ClassNotFoundExceptions yet are still > able to run to completion. It also requires that executors are able to leak > enough open connections to exhaust the class server's Jetty thread pool > limit, which requires that there are a large number of tasks (253+) and > either a large number of executors or a very low amount of GC pressure on > those executors (since GC will cause the leaked connections to be closed). > The fix here is pretty simple: add proper resource cleanup to this class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6209) ExecutorClassLoader can leak connections after failing to load classes from the REPL class server
[ https://issues.apache.org/jira/browse/SPARK-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6209: --- Assignee: Josh Rosen (was: Apache Spark) > ExecutorClassLoader can leak connections after failing to load classes from > the REPL class server > - > > Key: SPARK-6209 > URL: https://issues.apache.org/jira/browse/SPARK-6209 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0, 1.0.3, 1.1.2, 1.2.1, 1.3.0, 1.4.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.3.1, 1.4.0 > > > ExecutorClassLoader does not ensure proper cleanup of network connections > that it opens. If it fails to load a class, it may leak partially-consumed > InputStreams that are connected to the REPL's HTTP class server, causing that > server to exhaust its thread pool, which can cause the entire job to hang. > Here is a simple reproduction: > With > {code} > ./bin/spark-shell --master local-cluster[8,8,512] > {code} > run the following command: > {code} > sc.parallelize(1 to 1000, 1000).map { x => > try { > Class.forName("some.class.that.does.not.Exist") > } catch { > case e: Exception => // do nothing > } > x > }.count() > {code} > This job will run 253 tasks, then will completely freeze without any errors > or failed tasks. > It looks like the driver has 253 threads blocked in socketRead0() calls: > {code} > [joshrosen ~]$ jstack 16765 | grep socketRead0 | wc > 253 759 14674 > {code} > e.g. > {code} > "qtp1287429402-13" daemon prio=5 tid=0x7f868a1c nid=0x5b03 runnable > [0x0001159bd000] >java.lang.Thread.State: RUNNABLE > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:152) > at java.net.SocketInputStream.read(SocketInputStream.java:122) > at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391) > at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141) > at > org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227) > at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1044) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280) > at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) > at > org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) > at > org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > at java.lang.Thread.run(Thread.java:745) > {code} > Jstack on the executors shows blocking in loadClass / findClass, where a > single thread is RUNNABLE and waiting to hear back from the driver and other > executor threads are BLOCKED on object monitor synchronization at > Class.forName0(). > Remotely triggering a GC on a hanging executor allows the job to progress and > complete more tasks before hanging again. If I repeatedly trigger GC on all > of the executors, then the job runs to completion: > {code} > jps | grep CoarseGra | cut -d ' ' -f 1 | xargs -I {} -n 1 -P100 jcmd {} GC.run > {code} > The culprit is a {{catch}} block that ignores all exceptions and performs no > cleanup: > https://github.com/apache/spark/blob/v1.2.0/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L94 > This bug has been present since Spark 1.0.0, but I suspect that we haven't > seen it before because it's pretty hard to reproduce. Triggering this error > requires a job with tasks that trigger ClassNotFoundExceptions yet are still > able to run to completion. It also requires that executors are able to leak > enough open connections to exhaust the class server's Jetty thread pool > limit, which requires that there are a large number of tasks (253+) and > either a large number of executors or a very low amount of GC pressure on > those executors (since GC will cause the leaked connections to be closed). > The fix here is pretty simple: add proper resource cleanup to this class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6662) Allow variable substitution in spark.yarn.historyServer.address
[ https://issues.apache.org/jira/browse/SPARK-6662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392963#comment-14392963 ] Cheolsoo Park commented on SPARK-6662: -- [~srowen], thank you for your comment. {quote} Wouldn't you be able to query for the YARN RM address somewhere and include it in the config? {quote} In typical cloud deployment, there is usually shared gateway from where users can connect to various clusters, and there is few Spark configs shared by all the clusters. Furthermore, clusters are usually transient in cloud, so I'd like to avoid adding any cluster-specific information to Spark configs. My current workaround is grep'ing {{yarn.resourcemanager.hostname}} from yarn-site.xml in my custom job launch script on the gateway and passing it via {{--conf}} option in every job launch. The intention was to get rid of this hacky bit in my launch script. {quote} I am somewhat concerned about adding a narrow bit of support for one particular substitution, which in turn is to support a specific assumption in one type of deployment. {quote} Yes, I understand your concern. Even though I have a specific problem to solve at hand, I filed this jira hoping that general variable substitution will be added to Spark config. In fact, I made an attempt in that direction but quickly ran into the following problems: # Adding general vars sub to Spark conf doesn't solve my problem. Since Spark config and Yarn config are separate entities in Spark, I cannot cross-refer to properties from one to the other. # Alternatively, I could introduce a special logic for {{spark.yarn.historyServer.address}} assuming the RM and HS are on the same node. Since Spark AM already knows the RM address, it is trivial to implement. But this makes a even more specific assumption about the deployment. Looks to me that it involves quite a bit of refactoring to implement general vars sub that allows cross-referring. So I compromised. That is, I introduced vars sub only to the {{spark.yarn.}} properties. In fact, vars sub already work for {{spark.hadoop.}} properties. If you look at the code, all the {{spark.hadoop.}} properties are already copied over to Yarn config and read via Yarn config. As a side effect, they support vars sub. I am just expanding the scope of this *secret* feature to {{spark.yarn.}} properties. For now, I can live with my current workaround. But I wanted to point out that it is not user-friendly to ask users to pass explicit hostname and port number to make use of HS. In fact, I'm not aware of any other property that causes same pain in YARN mode. For eg, the RM address for {{spark.master}} is dynamically picked up from yarn-site.xml. The HS address should be handled in a similar manner IMO. Hope this explains my thought process well enough. > Allow variable substitution in spark.yarn.historyServer.address > --- > > Key: SPARK-6662 > URL: https://issues.apache.org/jira/browse/SPARK-6662 > Project: Spark > Issue Type: Wish > Components: YARN >Affects Versions: 1.3.0 >Reporter: Cheolsoo Park >Priority: Minor > Labels: yarn > > In Spark on YARN, explicit hostname and port number need to be set for > "spark.yarn.historyServer.address" in SparkConf to make the HISTORY link. If > the history server address is known and static, this is usually not a problem. > But in cloud, that is usually not true. Particularly in EMR, the history > server always runs on the same node as with RM. So I could simply set it to > {{$\{yarn.resourcemanager.hostname\}:18080}} if variable substitution is > allowed. > In fact, Hadoop configuration already implements variable substitution, so if > this property is read via YarnConf, this can be easily achievable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6506) python support yarn cluster mode requires SPARK_HOME to be set
[ https://issues.apache.org/jira/browse/SPARK-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392971#comment-14392971 ] Marcelo Vanzin commented on SPARK-6506: --- Maybe you're running into SPARK-5808? > python support yarn cluster mode requires SPARK_HOME to be set > -- > > Key: SPARK-6506 > URL: https://issues.apache.org/jira/browse/SPARK-6506 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.3.0 >Reporter: Thomas Graves > > We added support for python running in yarn cluster mode in > https://issues.apache.org/jira/browse/SPARK-5173, but it requires that > SPARK_HOME be set in the environment variables for application master and > executor. It doesn't have to be set to anything real but it fails if its not > set. See the command at the end of: https://github.com/apache/spark/pull/3976 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-765) Test suite should run Spark example programs
[ https://issues.apache.org/jira/browse/SPARK-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392975#comment-14392975 ] Josh Rosen commented on SPARK-765: -- Hi [~yuu.ishik...@gmail.com], It should be fined to add a {{test}}-scoped ScalaTest dependency to the examples subproject. > Test suite should run Spark example programs > > > Key: SPARK-765 > URL: https://issues.apache.org/jira/browse/SPARK-765 > Project: Spark > Issue Type: New Feature > Components: Examples >Reporter: Josh Rosen > > The Spark test suite should also run each of the Spark example programs (the > PySpark suite should do the same). This should be done through a shell > script or other mechanism to simulate the environment setup used by end users > that run those scripts. > This would prevent problems like SPARK-764 from making it into releases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6506) python support yarn cluster mode requires SPARK_HOME to be set
[ https://issues.apache.org/jira/browse/SPARK-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392978#comment-14392978 ] Thomas Graves commented on SPARK-6506: -- No it was built with maven and the pyspark artifacts are there. You just have to set SPARK_HOME to something even though it isn't used for anything. Something is looking at SPARK_HOME and blows up if its not set. > python support yarn cluster mode requires SPARK_HOME to be set > -- > > Key: SPARK-6506 > URL: https://issues.apache.org/jira/browse/SPARK-6506 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.3.0 >Reporter: Thomas Graves > > We added support for python running in yarn cluster mode in > https://issues.apache.org/jira/browse/SPARK-5173, but it requires that > SPARK_HOME be set in the environment variables for application master and > executor. It doesn't have to be set to anything real but it fails if its not > set. See the command at the end of: https://github.com/apache/spark/pull/3976 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings
[ https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392992#comment-14392992 ] Platon Potapov commented on SPARK-6569: --- If we talk about log level reduction only, then no, I don't feel strongly about the change. But I would suggest not logging anything at all. Cody's rationale behind the warning is for empty input batches to not go unnoticed. I feel that is a responsibility of a monitoring agent external to the spark job - an agent knowing the specifics of the use case the job is being run in, and hence having the capacity to decide whether an empty batch is an abnormal situation (in my case, it isn't). If that's not an option - well, please close the ticket then. > Kafka directInputStream logs what appear to be incorrect warnings > - > > Key: SPARK-6569 > URL: https://issues.apache.org/jira/browse/SPARK-6569 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.0 > Environment: Spark 1.3.0 >Reporter: Platon Potapov >Priority: Minor > > During what appears to be normal operation of streaming from a Kafka topic, > the following log records are observed, logged periodically: > {code} > [Stage 391:==> (3 + 0) / > 4] > 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the > same as ending offset skipping raw 0 > 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the > same as ending offset skipping raw 0 > 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the > same as ending offset skipping raw 0 > {code} > * the part.fromOffset placeholder is not correctly substituted to a value > * is the condition really mandates a warning being logged? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6618) HiveMetastoreCatalog.lookupRelation should use fine-grained lock
[ https://issues.apache.org/jira/browse/SPARK-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392991#comment-14392991 ] Apache Spark commented on SPARK-6618: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/5333 > HiveMetastoreCatalog.lookupRelation should use fine-grained lock > > > Key: SPARK-6618 > URL: https://issues.apache.org/jira/browse/SPARK-6618 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > Fix For: 1.3.1, 1.4.0 > > > Right now the entire method of HiveMetastoreCatalog.lookupRelation has a lock > (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L173) > and the scope of lock will cover resolving data source tables > (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L93). > So, lookupRelation can be extremely expensive when we are doing expensive > operations like parquet schema discovery. So, we should use fine-grained lock > for lookupRelation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393017#comment-14393017 ] Joseph K. Bradley commented on SPARK-5992: -- That sounds good; I'll try to take a look at that paper soon. Will you be able to write a short design doc? > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5972) Cache residuals for GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5972: - Assignee: Manoj Kumar > Cache residuals for GradientBoostedTrees during training > > > Key: SPARK-5972 > URL: https://issues.apache.org/jira/browse/SPARK-5972 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar >Priority: Minor > > In gradient boosting, the current model's prediction is re-computed for each > training instance on every iteration. The current residual (cumulative > prediction of previously trained trees in the ensemble) should be cached. > That could reduce both computation (only computing the prediction of the most > recently trained tree) and communication (only sending the most recently > trained tree to the workers). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393021#comment-14393021 ] Joseph K. Bradley commented on SPARK-6407: -- I'm not too familiar with the area, but it seems similar to randomized linear algebra work if you can assume the incoming data is i.i.d. But [~mengxr] may be more familiar with this literature than me... > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6194) collect() in PySpark will cause memory leak in JVM
[ https://issues.apache.org/jira/browse/SPARK-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6194: --- Assignee: Apache Spark (was: Davies Liu) > collect() in PySpark will cause memory leak in JVM > -- > > Key: SPARK-6194 > URL: https://issues.apache.org/jira/browse/SPARK-6194 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 >Reporter: Davies Liu >Assignee: Apache Spark >Priority: Critical > Fix For: 1.2.2, 1.3.1, 1.4.0 > > > It could be reproduced by: > {code} > for i in range(40): > sc.parallelize(range(5000), 10).flatMap(lambda i: range(1)).collect() > {code} > It will fail after 2 or 3 jobs, and run totally successfully if I add > `gc.collect()` after each job. > We could call _detach() for the JavaList returned by collect > in Java, will send out a PR for this. > Reported by Michael and commented by Josh: > On Thu, Mar 5, 2015 at 2:39 PM, Josh Rosen wrote: > > Based on Py4J's Memory Model page > > (http://py4j.sourceforge.net/advanced_topics.html#py4j-memory-model): > > > >> Because Java objects on the Python side are involved in a circular > >> reference (JavaObject and JavaMember reference each other), these objects > >> are not immediately garbage collected once the last reference to the object > >> is removed (but they are guaranteed to be eventually collected if the > >> Python > >> garbage collector runs before the Python program exits). > > > > > >> > >> In doubt, users can always call the detach function on the Python gateway > >> to explicitly delete a reference on the Java side. A call to gc.collect() > >> also usually works. > > > > > > Maybe we should be manually calling detach() when the Python-side has > > finished consuming temporary objects from the JVM. Do you have a small > > workload / configuration that reproduces the OOM which we can use to test a > > fix? I don't think that I've seen this issue in the past, but this might be > > because we mistook Java OOMs as being caused by collecting too much data > > rather than due to memory leaks. > > > > On Thu, Mar 5, 2015 at 10:41 AM, Michael Nazario > > wrote: > >> > >> Hi Josh, > >> > >> I have a question about how PySpark does memory management in the Py4J > >> bridge between the Java driver and the Python driver. I was wondering if > >> there have been any memory problems in this system because the Python > >> garbage collector does not collect circular references immediately and Py4J > >> has circular references in each object it receives from Java. > >> > >> When I dug through the PySpark code, I seemed to find that most RDD > >> actions return by calling collect. In collect, you end up calling the Java > >> RDD collect and getting an iterator from that. Would this be a possible > >> cause for a Java driver OutOfMemoryException because there are resources in > >> Java which do not get freed up immediately? > >> > >> I have also seen that trying to take a lot of values from a dataset twice > >> in a row can cause the Java driver to OOM (while just once works). Are > >> there > >> some other memory considerations that are relevant in the driver? > >> > >> Thanks, > >> Michael -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6194) collect() in PySpark will cause memory leak in JVM
[ https://issues.apache.org/jira/browse/SPARK-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6194: --- Assignee: Davies Liu (was: Apache Spark) > collect() in PySpark will cause memory leak in JVM > -- > > Key: SPARK-6194 > URL: https://issues.apache.org/jira/browse/SPARK-6194 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 1.2.2, 1.3.1, 1.4.0 > > > It could be reproduced by: > {code} > for i in range(40): > sc.parallelize(range(5000), 10).flatMap(lambda i: range(1)).collect() > {code} > It will fail after 2 or 3 jobs, and run totally successfully if I add > `gc.collect()` after each job. > We could call _detach() for the JavaList returned by collect > in Java, will send out a PR for this. > Reported by Michael and commented by Josh: > On Thu, Mar 5, 2015 at 2:39 PM, Josh Rosen wrote: > > Based on Py4J's Memory Model page > > (http://py4j.sourceforge.net/advanced_topics.html#py4j-memory-model): > > > >> Because Java objects on the Python side are involved in a circular > >> reference (JavaObject and JavaMember reference each other), these objects > >> are not immediately garbage collected once the last reference to the object > >> is removed (but they are guaranteed to be eventually collected if the > >> Python > >> garbage collector runs before the Python program exits). > > > > > >> > >> In doubt, users can always call the detach function on the Python gateway > >> to explicitly delete a reference on the Java side. A call to gc.collect() > >> also usually works. > > > > > > Maybe we should be manually calling detach() when the Python-side has > > finished consuming temporary objects from the JVM. Do you have a small > > workload / configuration that reproduces the OOM which we can use to test a > > fix? I don't think that I've seen this issue in the past, but this might be > > because we mistook Java OOMs as being caused by collecting too much data > > rather than due to memory leaks. > > > > On Thu, Mar 5, 2015 at 10:41 AM, Michael Nazario > > wrote: > >> > >> Hi Josh, > >> > >> I have a question about how PySpark does memory management in the Py4J > >> bridge between the Java driver and the Python driver. I was wondering if > >> there have been any memory problems in this system because the Python > >> garbage collector does not collect circular references immediately and Py4J > >> has circular references in each object it receives from Java. > >> > >> When I dug through the PySpark code, I seemed to find that most RDD > >> actions return by calling collect. In collect, you end up calling the Java > >> RDD collect and getting an iterator from that. Would this be a possible > >> cause for a Java driver OutOfMemoryException because there are resources in > >> Java which do not get freed up immediately? > >> > >> I have also seen that trying to take a lot of values from a dataset twice > >> in a row can cause the Java driver to OOM (while just once works). Are > >> there > >> some other memory considerations that are relevant in the driver? > >> > >> Thanks, > >> Michael -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6667) hang while collect in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-6667: -- Comment: was deleted (was: This patch introduced a rare bug that can cause a hang while calling {{collect()}} (SPARK-6194); I'm commenting here so we remember to put the fix into the correct branches.) > hang while collect in PySpark > - > > Key: SPARK-6667 > URL: https://issues.apache.org/jira/browse/SPARK-6667 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.1, 1.4.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > > PySpark tests hang while collecting: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-677) PySpark should not collect results through local filesystem
[ https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-677: -- Assignee: Davies Liu (was: Apache Spark) > PySpark should not collect results through local filesystem > --- > > Key: SPARK-677 > URL: https://issues.apache.org/jira/browse/SPARK-677 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0, 1.4.0 >Reporter: Josh Rosen >Assignee: Davies Liu > > Py4J is slow when transferring large arrays, so PySpark currently dumps data > to the disk and reads it back in order to collect() RDDs. On large enough > datasets, this data will spill from the buffer cache and write to the > physical disk, resulting in terrible performance. > Instead, we should stream the data from Java to Python over a local socket or > a FIFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6667) hang while collect in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393023#comment-14393023 ] Josh Rosen commented on SPARK-6667: --- This patch introduced a rare bug that can cause a hang while calling {{collect()}} (SPARK-6194); I'm commenting here so we remember to put the fix into the correct branches. > hang while collect in PySpark > - > > Key: SPARK-6667 > URL: https://issues.apache.org/jira/browse/SPARK-6667 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.1, 1.4.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > > PySpark tests hang while collecting: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-677) PySpark should not collect results through local filesystem
[ https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-677: -- Assignee: Apache Spark (was: Davies Liu) > PySpark should not collect results through local filesystem > --- > > Key: SPARK-677 > URL: https://issues.apache.org/jira/browse/SPARK-677 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0, 1.4.0 >Reporter: Josh Rosen >Assignee: Apache Spark > > Py4J is slow when transferring large arrays, so PySpark currently dumps data > to the disk and reads it back in order to collect() RDDs. On large enough > datasets, this data will spill from the buffer cache and write to the > physical disk, resulting in terrible performance. > Instead, we should stream the data from Java to Python over a local socket or > a FIFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6194) collect() in PySpark will cause memory leak in JVM
[ https://issues.apache.org/jira/browse/SPARK-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393026#comment-14393026 ] Josh Rosen commented on SPARK-6194: --- This patch introduced a rare bug that can cause a hang while calling {{collect()}} (SPARK-6194); I'm commenting here so that the issues are linked to ensure that the fixes land in the right branches. > collect() in PySpark will cause memory leak in JVM > -- > > Key: SPARK-6194 > URL: https://issues.apache.org/jira/browse/SPARK-6194 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 1.2.2, 1.3.1, 1.4.0 > > > It could be reproduced by: > {code} > for i in range(40): > sc.parallelize(range(5000), 10).flatMap(lambda i: range(1)).collect() > {code} > It will fail after 2 or 3 jobs, and run totally successfully if I add > `gc.collect()` after each job. > We could call _detach() for the JavaList returned by collect > in Java, will send out a PR for this. > Reported by Michael and commented by Josh: > On Thu, Mar 5, 2015 at 2:39 PM, Josh Rosen wrote: > > Based on Py4J's Memory Model page > > (http://py4j.sourceforge.net/advanced_topics.html#py4j-memory-model): > > > >> Because Java objects on the Python side are involved in a circular > >> reference (JavaObject and JavaMember reference each other), these objects > >> are not immediately garbage collected once the last reference to the object > >> is removed (but they are guaranteed to be eventually collected if the > >> Python > >> garbage collector runs before the Python program exits). > > > > > >> > >> In doubt, users can always call the detach function on the Python gateway > >> to explicitly delete a reference on the Java side. A call to gc.collect() > >> also usually works. > > > > > > Maybe we should be manually calling detach() when the Python-side has > > finished consuming temporary objects from the JVM. Do you have a small > > workload / configuration that reproduces the OOM which we can use to test a > > fix? I don't think that I've seen this issue in the past, but this might be > > because we mistook Java OOMs as being caused by collecting too much data > > rather than due to memory leaks. > > > > On Thu, Mar 5, 2015 at 10:41 AM, Michael Nazario > > wrote: > >> > >> Hi Josh, > >> > >> I have a question about how PySpark does memory management in the Py4J > >> bridge between the Java driver and the Python driver. I was wondering if > >> there have been any memory problems in this system because the Python > >> garbage collector does not collect circular references immediately and Py4J > >> has circular references in each object it receives from Java. > >> > >> When I dug through the PySpark code, I seemed to find that most RDD > >> actions return by calling collect. In collect, you end up calling the Java > >> RDD collect and getting an iterator from that. Would this be a possible > >> cause for a Java driver OutOfMemoryException because there are resources in > >> Java which do not get freed up immediately? > >> > >> I have also seen that trying to take a lot of values from a dataset twice > >> in a row can cause the Java driver to OOM (while just once works). Are > >> there > >> some other memory considerations that are relevant in the driver? > >> > >> Thanks, > >> Michael -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6194) collect() in PySpark will cause memory leak in JVM
[ https://issues.apache.org/jira/browse/SPARK-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-6194. --- Resolution: Fixed Target Version/s: 1.3.0, 1.2.2 (was: 1.0.3, 1.1.2, 1.2.2, 1.3.0) Going to resolve this as "Fixed" for now, since I don't think we're going to backport prior to 1.2.x. > collect() in PySpark will cause memory leak in JVM > -- > > Key: SPARK-6194 > URL: https://issues.apache.org/jira/browse/SPARK-6194 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 1.2.2, 1.3.1, 1.4.0 > > > It could be reproduced by: > {code} > for i in range(40): > sc.parallelize(range(5000), 10).flatMap(lambda i: range(1)).collect() > {code} > It will fail after 2 or 3 jobs, and run totally successfully if I add > `gc.collect()` after each job. > We could call _detach() for the JavaList returned by collect > in Java, will send out a PR for this. > Reported by Michael and commented by Josh: > On Thu, Mar 5, 2015 at 2:39 PM, Josh Rosen wrote: > > Based on Py4J's Memory Model page > > (http://py4j.sourceforge.net/advanced_topics.html#py4j-memory-model): > > > >> Because Java objects on the Python side are involved in a circular > >> reference (JavaObject and JavaMember reference each other), these objects > >> are not immediately garbage collected once the last reference to the object > >> is removed (but they are guaranteed to be eventually collected if the > >> Python > >> garbage collector runs before the Python program exits). > > > > > >> > >> In doubt, users can always call the detach function on the Python gateway > >> to explicitly delete a reference on the Java side. A call to gc.collect() > >> also usually works. > > > > > > Maybe we should be manually calling detach() when the Python-side has > > finished consuming temporary objects from the JVM. Do you have a small > > workload / configuration that reproduces the OOM which we can use to test a > > fix? I don't think that I've seen this issue in the past, but this might be > > because we mistook Java OOMs as being caused by collecting too much data > > rather than due to memory leaks. > > > > On Thu, Mar 5, 2015 at 10:41 AM, Michael Nazario > > wrote: > >> > >> Hi Josh, > >> > >> I have a question about how PySpark does memory management in the Py4J > >> bridge between the Java driver and the Python driver. I was wondering if > >> there have been any memory problems in this system because the Python > >> garbage collector does not collect circular references immediately and Py4J > >> has circular references in each object it receives from Java. > >> > >> When I dug through the PySpark code, I seemed to find that most RDD > >> actions return by calling collect. In collect, you end up calling the Java > >> RDD collect and getting an iterator from that. Would this be a possible > >> cause for a Java driver OutOfMemoryException because there are resources in > >> Java which do not get freed up immediately? > >> > >> I have also seen that trying to take a lot of values from a dataset twice > >> in a row can cause the Java driver to OOM (while just once works). Are > >> there > >> some other memory considerations that are relevant in the driver? > >> > >> Thanks, > >> Michael -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6681) JAVA_HOME error with upgrade to Spark 1.3.0
Ken Williams created SPARK-6681: --- Summary: JAVA_HOME error with upgrade to Spark 1.3.0 Key: SPARK-6681 URL: https://issues.apache.org/jira/browse/SPARK-6681 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 1.3.0 Environment: Client is Mac OS X version 10.10.2, cluster is running HDP 2.1 stack. Reporter: Ken Williams I’m trying to upgrade a Spark project, written in Scala, from Spark 1.2.1 to 1.3.0, so I changed my `build.sbt` like so: {code} -libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.1" % "provided" +libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" % "provided" {code} then make an `assembly` jar, and submit it: {code} HADOOP_CONF_DIR=/etc/hadoop/conf \ spark-submit \ --driver-class-path=/etc/hbase/conf \ --conf spark.hadoop.validateOutputSpecs=false \ --conf spark.yarn.jar=hdfs:/apps/local/spark-assembly-1.3.0-hadoop2.4.0.jar \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --deploy-mode=cluster \ --master=yarn \ --class=TestObject \ --num-executors=54 \ target/scala-2.11/myapp-assembly-1.2.jar {code} The job fails to submit, with the following exception in the terminal: {code} 15/03/19 10:30:07 INFO yarn.Client: 15/03/19 10:20:03 INFO yarn.Client: client token: N/A diagnostics: Application application_1420225286501_4698 failed 2 times due to AM Container for appattempt_1420225286501_4698_02 exited with exitCode: 127 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {code} Finally, I go and check the YARN app master’s web interface (since the job is there, I know it at least made it that far), and the only logs it shows are these: {code} Log Type: stderr Log Length: 61 /bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory Log Type: stdout Log Length: 0 {code} I’m not sure how to interpret that - is {{ {{JAVA_HOME}} }} a literal (including the brackets) that’s somehow making it into a script? Is this coming from the worker nodes or the driver? Anything I can do to experiment & troubleshoot? I do have {{JAVA_HOME}} set in the hadoop config files on all the nodes of the cluster: {code} % grep JAVA_HOME /etc/hadoop/conf/*.sh /etc/hadoop/conf/hadoop-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31 /etc/hadoop/conf/yarn-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31 {code} Has this behavior changed in 1.3.0 since 1.2.1? Using 1.2.1 and making no other changes, the job completes fine. (Note: I originally posted this on the Spark mailing list and also on Stack Overflow, I'll update both places if/when I find a solution.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6667) hang while collect in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-6667: -- Priority: Blocker (was: Critical) Target Version/s: 1.2.2, 1.3.1, 1.4.0 (was: 1.3.1, 1.4.0) Affects Version/s: 1.2.2 This was introduced by SPARK-6194, which was merged for 1.2.2, so I'm adding 1.2.2. as an affected / target version. > hang while collect in PySpark > - > > Key: SPARK-6667 > URL: https://issues.apache.org/jira/browse/SPARK-6667 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.2, 1.3.1, 1.4.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > > PySpark tests hang while collecting: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
Joseph K. Bradley created SPARK-6682: Summary: Deprecate static train and use builder instead for Scala/Java Key: SPARK-6682 URL: https://issues.apache.org/jira/browse/SPARK-6682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. "Old static train()" API: {code} val myModel = NaiveBayes.train(myData, ...) {code} "New builder pattern" API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons: * In Python APIs, static train methods are more "Pythonic." Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6683) GLMs with GradientDescent could scale step size instead of features
Joseph K. Bradley created SPARK-6683: Summary: GLMs with GradientDescent could scale step size instead of features Key: SPARK-6683 URL: https://issues.apache.org/jira/browse/SPARK-6683 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor GeneralizedLinearAlgorithm can scale features. This improves optimization behavior (and also affects the optimal solution, as is being discussed and hopefully fixed by [https://github.com/apache/spark/pull/5055]). This is a bit inefficient since it requires making a rescaled copy of the data. GradientDescent could instead scale the step size separately for each feature (and adjust regularization as needed; see the PR linked above). This would require storing a vector of length numFeatures, rather than making a full copy of the data. I haven't thought this through for LBFGS, so I'm not sure if it's generally usable or would require a specialization for GLMs with GradientDescent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393141#comment-14393141 ] Zhan Zhang commented on SPARK-2883: --- Following code demonstrate the usage of the orc support. import org.apache.spark.sql.hive.orc._ import org.apache.spark.sql._ //saveAsOrcFile case class AllDataTypes( stringField: String, intField: Int, longField: Long, floatField: Float, doubleField: Double, shortField: Short, byteField: Byte, booleanField: Boolean) val range = (0 to 255) val data = sc.parallelize(range).map(x => AllDataTypes(s"$x", x, x.toLong, x.toFloat, x.toDouble, x.toShort, x.toByte, x % 2 == 0)) data.toDF().saveAsOrcFile("orcTest") //read orcFile val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val orcTest = hiveContext.orcFile("orcTest") orcTest.registerTempTable("orcTest") hiveContext.sql("SELECT * from orcTest where intfield>185").collect.foreach(println) hiveContext.sql("create temporary table orc using org.apache.spark.sql.hive.orc OPTIONS (path \"orcTest\")") hiveContext.sql("select * from orc").collect.foreach(println) val table = hiveContext.sql("select * from orc") table.saveAsTable("table", "org.apache.spark.sql.hive.orc") val hiveOrc = hiveContext.orcFile("/user/hive/warehouse/table") hiveOrc.registerTempTable("hiveOrc") hiveContext.sql("select * from hiveOrc").collect.foreach(println) table.saveAsOrcFile("/user/ambari-qa/table") hiveContext.sql("create temporary table normal_orc_as_source USING org.apache.spark.sql.hive.orc OPTIONS (path 'saveTable') as select * from table") > Spark Support for ORCFile format > > > Key: SPARK-2883 > URL: https://issues.apache.org/jira/browse/SPARK-2883 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Reporter: Zhan Zhang >Priority: Blocker > Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 > pm jobtracker.png, orc.diff > > > Verify the support of OrcInputFormat in spark, fix issues if exists and add > documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3720) support ORC in spark sql
[ https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393146#comment-14393146 ] Zhan Zhang commented on SPARK-3720: --- [~iward] I have update the patch with new api support. you can refer to https://issues.apache.org/jira/browse/SPARK-2883 > support ORC in spark sql > > > Key: SPARK-3720 > URL: https://issues.apache.org/jira/browse/SPARK-3720 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.1.0 >Reporter: Fei Wang > Attachments: orc.diff > > > The Optimized Row Columnar (ORC) file format provides a highly efficient way > to store data on hdfs.ORC file format has many advantages such as: > 1 a single file as the output of each task, which reduces the NameNode's load > 2 Hive type support including datetime, decimal, and the complex types > (struct, list, map, and union) > 3 light-weight indexes stored within the file > skip row groups that don't pass predicate filtering > seek to a given row > 4 block-mode compression based on data type > run-length encoding for integer columns > dictionary encoding for string columns > 5 concurrent reads of the same file using separate RecordReaders > 6 ability to split files without scanning for markers > 7 bound the amount of memory needed for reading or writing > 8 metadata stored using Protocol Buffers, which allows addition and removal > of fields > Now spark sql support Parquet, support ORC provide people more opts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6684) Add checkpointing to GradientBoostedTrees
Joseph K. Bradley created SPARK-6684: Summary: Add checkpointing to GradientBoostedTrees Key: SPARK-6684 URL: https://issues.apache.org/jira/browse/SPARK-6684 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor We should add checkpointing to GradientBoostedTrees since it maintains RDDs with long lineages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6671) Add status command for spark daemons
[ https://issues.apache.org/jira/browse/SPARK-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6671: --- Assignee: (was: Apache Spark) > Add status command for spark daemons > > > Key: SPARK-6671 > URL: https://issues.apache.org/jira/browse/SPARK-6671 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: PRADEEP CHANUMOLU > Labels: easyfix > Fix For: 1.4.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently using the spark-daemon.sh script we can start and stop the spark > demons. But we cannot get the status of the daemons. It will be nice to > include the status command in the spark-daemon.sh script, through which we > can know if the spark demon is alive or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6671) Add status command for spark daemons
[ https://issues.apache.org/jira/browse/SPARK-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393182#comment-14393182 ] Apache Spark commented on SPARK-6671: - User 'pchanumolu' has created a pull request for this issue: https://github.com/apache/spark/pull/5327 > Add status command for spark daemons > > > Key: SPARK-6671 > URL: https://issues.apache.org/jira/browse/SPARK-6671 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: PRADEEP CHANUMOLU > Labels: easyfix > Fix For: 1.4.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently using the spark-daemon.sh script we can start and stop the spark > demons. But we cannot get the status of the daemons. It will be nice to > include the status command in the spark-daemon.sh script, through which we > can know if the spark demon is alive or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6671) Add status command for spark daemons
[ https://issues.apache.org/jira/browse/SPARK-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6671: --- Assignee: Apache Spark > Add status command for spark daemons > > > Key: SPARK-6671 > URL: https://issues.apache.org/jira/browse/SPARK-6671 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: PRADEEP CHANUMOLU >Assignee: Apache Spark > Labels: easyfix > Fix For: 1.4.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently using the spark-daemon.sh script we can start and stop the spark > demons. But we cannot get the status of the daemons. It will be nice to > include the status command in the spark-daemon.sh script, through which we > can know if the spark demon is alive or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6667) hang while collect in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-6667. --- Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 1.2.2 > hang while collect in PySpark > - > > Key: SPARK-6667 > URL: https://issues.apache.org/jira/browse/SPARK-6667 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.2, 1.3.1, 1.4.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.2.2, 1.3.1, 1.4.0 > > > PySpark tests hang while collecting: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4194) Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state
[ https://issues.apache.org/jira/browse/SPARK-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4194: --- Assignee: Apache Spark > Exceptions thrown during SparkContext or SparkEnv construction might lead to > resource leaks or corrupted global state > - > > Key: SPARK-4194 > URL: https://issues.apache.org/jira/browse/SPARK-4194 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Critical > > The SparkContext and SparkEnv constructors instantiate a bunch of objects > that may need to be cleaned up after they're no longer needed. If an > exception is thrown during SparkContext or SparkEnv construction (e.g. due to > a bad configuration setting), then objects created earlier in the constructor > may not be properly cleaned up. > This is unlikely to cause problems for batch jobs submitted through > {{spark-submit}}, since failure to construct SparkContext will probably cause > the JVM to exit, but it is a potentially serious issue in interactive > environments where a user might attempt to create SparkContext with some > configuration, fail due to an error, and re-attempt the creation with new > settings. In this case, resources from the previous creation attempt might > not have been cleaned up and could lead to confusing errors (especially if > the old, leaked resources share global state with the new SparkContext). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4194) Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state
[ https://issues.apache.org/jira/browse/SPARK-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393249#comment-14393249 ] Apache Spark commented on SPARK-4194: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/5335 > Exceptions thrown during SparkContext or SparkEnv construction might lead to > resource leaks or corrupted global state > - > > Key: SPARK-4194 > URL: https://issues.apache.org/jira/browse/SPARK-4194 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Josh Rosen >Priority: Critical > > The SparkContext and SparkEnv constructors instantiate a bunch of objects > that may need to be cleaned up after they're no longer needed. If an > exception is thrown during SparkContext or SparkEnv construction (e.g. due to > a bad configuration setting), then objects created earlier in the constructor > may not be properly cleaned up. > This is unlikely to cause problems for batch jobs submitted through > {{spark-submit}}, since failure to construct SparkContext will probably cause > the JVM to exit, but it is a potentially serious issue in interactive > environments where a user might attempt to create SparkContext with some > configuration, fail due to an error, and re-attempt the creation with new > settings. In this case, resources from the previous creation attempt might > not have been cleaned up and could lead to confusing errors (especially if > the old, leaked resources share global state with the new SparkContext). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4194) Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state
[ https://issues.apache.org/jira/browse/SPARK-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4194: --- Assignee: (was: Apache Spark) > Exceptions thrown during SparkContext or SparkEnv construction might lead to > resource leaks or corrupted global state > - > > Key: SPARK-4194 > URL: https://issues.apache.org/jira/browse/SPARK-4194 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Josh Rosen >Priority: Critical > > The SparkContext and SparkEnv constructors instantiate a bunch of objects > that may need to be cleaned up after they're no longer needed. If an > exception is thrown during SparkContext or SparkEnv construction (e.g. due to > a bad configuration setting), then objects created earlier in the constructor > may not be properly cleaned up. > This is unlikely to cause problems for batch jobs submitted through > {{spark-submit}}, since failure to construct SparkContext will probably cause > the JVM to exit, but it is a potentially serious issue in interactive > environments where a user might attempt to create SparkContext with some > configuration, fail due to an error, and re-attempt the creation with new > settings. In this case, resources from the previous creation attempt might > not have been cleaned up and could lead to confusing errors (especially if > the old, leaked resources share global state with the new SparkContext). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393141#comment-14393141 ] Zhan Zhang edited comment on SPARK-2883 at 4/2/15 7:54 PM: --- Following code demonstrate the usage of the orc support. @climberus following examples demonstrate how to use it: import org.apache.spark.sql.hive.orc._ import org.apache.spark.sql._ //schema case class AllDataTypes( stringField: String, intField: Int, longField: Long, floatField: Float, doubleField: Double, shortField: Short, byteField: Byte, booleanField: Boolean) //saveAsOrcFile val range = (0 to 255) val data = sc.parallelize(range).map(x => AllDataTypes(s"$x", x, x.toLong, x.toFloat, x.toDouble, x.toShort, x.toByte, x % 2 == 0)) data.toDF().saveAsOrcFile("orcTest") //read orcFile val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) //orcFile val orcTest = hiveContext.orcFile("orcTest") orcTest.registerTempTable("orcTest") hiveContext.sql("SELECT * from orcTest where intfield>185").collect.foreach(println) //new data source API, read hiveContext.sql("create temporary table orc using org.apache.spark.sql.hive.orc OPTIONS (path \"orcTest\")") hiveContext.sql("select * from orc").collect.foreach(println) val table = hiveContext.sql("select * from orc") // new data source API write table.saveAsTable("table", "org.apache.spark.sql.hive.orc") val hiveOrc = hiveContext.orcFile("/user/hive/warehouse/table") hiveOrc.registerTempTable("hiveOrc") hiveContext.sql("select * from hiveOrc").collect.foreach(println) table.saveAsOrcFile("/user/ambari-qa/table") hiveContext.sql("create temporary table normal_orc_as_source USING org.apache.spark.sql.hive.orc OPTIONS (path 'saveTable') as select * from table") was (Author: zzhan): Following code demonstrate the usage of the orc support. import org.apache.spark.sql.hive.orc._ import org.apache.spark.sql._ //saveAsOrcFile case class AllDataTypes( stringField: String, intField: Int, longField: Long, floatField: Float, doubleField: Double, shortField: Short, byteField: Byte, booleanField: Boolean) val range = (0 to 255) val data = sc.parallelize(range).map(x => AllDataTypes(s"$x", x, x.toLong, x.toFloat, x.toDouble, x.toShort, x.toByte, x % 2 == 0)) data.toDF().saveAsOrcFile("orcTest") //read orcFile val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val orcTest = hiveContext.orcFile("orcTest") orcTest.registerTempTable("orcTest") hiveContext.sql("SELECT * from orcTest where intfield>185").collect.foreach(println) hiveContext.sql("create temporary table orc using org.apache.spark.sql.hive.orc OPTIONS (path \"orcTest\")") hiveContext.sql("select * from orc").collect.foreach(println) val table = hiveContext.sql("select * from orc") table.saveAsTable("table", "org.apache.spark.sql.hive.orc") val hiveOrc = hiveContext.orcFile("/user/hive/warehouse/table") hiveOrc.registerTempTable("hiveOrc") hiveContext.sql("select * from hiveOrc").collect.foreach(println) table.saveAsOrcFile("/user/ambari-qa/table") hiveContext.sql("create temporary table normal_orc_as_source USING org.apache.spark.sql.hive.orc OPTIONS (path 'saveTable') as select * from table") > Spark Support for ORCFile format > > > Key: SPARK-2883 > URL: https://issues.apache.org/jira/browse/SPARK-2883 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Reporter: Zhan Zhang >Priority: Blocker > Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 > pm jobtracker.png, orc.diff > > > Verify the support of OrcInputFormat in spark, fix issues if exists and add > documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393141#comment-14393141 ] Zhan Zhang edited comment on SPARK-2883 at 4/2/15 7:54 PM: --- Following code demonstrate the usage of the orc support. import org.apache.spark.sql.hive.orc._ import org.apache.spark.sql._ //schema case class AllDataTypes( stringField: String, intField: Int, longField: Long, floatField: Float, doubleField: Double, shortField: Short, byteField: Byte, booleanField: Boolean) //saveAsOrcFile val range = (0 to 255) val data = sc.parallelize(range).map(x => AllDataTypes(s"$x", x, x.toLong, x.toFloat, x.toDouble, x.toShort, x.toByte, x % 2 == 0)) data.toDF().saveAsOrcFile("orcTest") //read orcFile val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) //orcFile val orcTest = hiveContext.orcFile("orcTest") orcTest.registerTempTable("orcTest") hiveContext.sql("SELECT * from orcTest where intfield>185").collect.foreach(println) //new data source API, read hiveContext.sql("create temporary table orc using org.apache.spark.sql.hive.orc OPTIONS (path \"orcTest\")") hiveContext.sql("select * from orc").collect.foreach(println) val table = hiveContext.sql("select * from orc") // new data source API write table.saveAsTable("table", "org.apache.spark.sql.hive.orc") val hiveOrc = hiveContext.orcFile("/user/hive/warehouse/table") hiveOrc.registerTempTable("hiveOrc") hiveContext.sql("select * from hiveOrc").collect.foreach(println) table.saveAsOrcFile("/user/ambari-qa/table") hiveContext.sql("create temporary table normal_orc_as_source USING org.apache.spark.sql.hive.orc OPTIONS (path 'saveTable') as select * from table") was (Author: zzhan): Following code demonstrate the usage of the orc support. @climberus following examples demonstrate how to use it: import org.apache.spark.sql.hive.orc._ import org.apache.spark.sql._ //schema case class AllDataTypes( stringField: String, intField: Int, longField: Long, floatField: Float, doubleField: Double, shortField: Short, byteField: Byte, booleanField: Boolean) //saveAsOrcFile val range = (0 to 255) val data = sc.parallelize(range).map(x => AllDataTypes(s"$x", x, x.toLong, x.toFloat, x.toDouble, x.toShort, x.toByte, x % 2 == 0)) data.toDF().saveAsOrcFile("orcTest") //read orcFile val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) //orcFile val orcTest = hiveContext.orcFile("orcTest") orcTest.registerTempTable("orcTest") hiveContext.sql("SELECT * from orcTest where intfield>185").collect.foreach(println) //new data source API, read hiveContext.sql("create temporary table orc using org.apache.spark.sql.hive.orc OPTIONS (path \"orcTest\")") hiveContext.sql("select * from orc").collect.foreach(println) val table = hiveContext.sql("select * from orc") // new data source API write table.saveAsTable("table", "org.apache.spark.sql.hive.orc") val hiveOrc = hiveContext.orcFile("/user/hive/warehouse/table") hiveOrc.registerTempTable("hiveOrc") hiveContext.sql("select * from hiveOrc").collect.foreach(println) table.saveAsOrcFile("/user/ambari-qa/table") hiveContext.sql("create temporary table normal_orc_as_source USING org.apache.spark.sql.hive.orc OPTIONS (path 'saveTable') as select * from table") > Spark Support for ORCFile format > > > Key: SPARK-2883 > URL: https://issues.apache.org/jira/browse/SPARK-2883 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Reporter: Zhan Zhang >Priority: Blocker > Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 > pm jobtracker.png, orc.diff > > > Verify the support of OrcInputFormat in spark, fix issues if exists and add > documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6479) Create off-heap block storage API (internal)
[ https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-6479: -- Attachment: SPARK-6479.pdf This is the updated version for offheap store internal api design. > Create off-heap block storage API (internal) > > > Key: SPARK-6479 > URL: https://issues.apache.org/jira/browse/SPARK-6479 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Reporter: Reynold Xin > Attachments: SPARK-6479.pdf, SparkOffheapsupportbyHDFS.pdf > > > Would be great to create APIs for off-heap block stores, rather than doing a > bunch of if statements everywhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6578) Outbound channel in network library is not thread-safe, can lead to fetch failures
[ https://issues.apache.org/jira/browse/SPARK-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393284#comment-14393284 ] Apache Spark commented on SPARK-6578: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/5336 > Outbound channel in network library is not thread-safe, can lead to fetch > failures > -- > > Key: SPARK-6578 > URL: https://issues.apache.org/jira/browse/SPARK-6578 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Blocker > Fix For: 1.3.1, 1.4.0 > > > There is a very narrow race in the outbound channel of the network library. > While netty guarantees that the inbound channel is thread-safe, the same is > not true for the outbound channel: multiple threads can be writing and > running the pipeline at the same time. > This leads to an issue with MessageEncoder and the optimization it performs > for zero-copy of file data: since a single RPC can be broken into multiple > buffers (for , example when replying to a chunk request), if you have > multiple threads writing these RPCs then they can be mixed up in the final > socket. That breaks framing and will cause the receiving side to not > understand the messages. > Patch coming up shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6112) Provide OffHeap support through HDFS RAM_DISK
[ https://issues.apache.org/jira/browse/SPARK-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393287#comment-14393287 ] Zhan Zhang commented on SPARK-6112: --- Design spec for API attached to SPARK-6479 and wait for the community feedback. > Provide OffHeap support through HDFS RAM_DISK > - > > Key: SPARK-6112 > URL: https://issues.apache.org/jira/browse/SPARK-6112 > Project: Spark > Issue Type: New Feature > Components: Block Manager >Reporter: Zhan Zhang > Attachments: SparkOffheapsupportbyHDFS.pdf > > > HDFS Lazy_Persist policy provide possibility to cache the RDD off_heap in > hdfs. We may want to provide similar capacity to Tachyon by leveraging hdfs > RAM_DISK feature, if the user environment does not have tachyon deployed. > With this feature, it potentially provides possibility to share RDD in memory > across different jobs and even share with jobs other than spark, and avoid > the RDD recomputation if executors crash. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6079) Use index to speed up StatusTracker.getJobIdsForGroup()
[ https://issues.apache.org/jira/browse/SPARK-6079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-6079: -- Fix Version/s: 1.3.1 > Use index to speed up StatusTracker.getJobIdsForGroup() > --- > > Key: SPARK-6079 > URL: https://issues.apache.org/jira/browse/SPARK-6079 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > Fix For: 1.3.1, 1.4.0 > > > {{StatusTracker.getJobIdsForGroup()}} is implemented via a linear scan over a > HashMap rather than using an index. This might be an expensive operation if > there are many (e.g. thousands) of retained jobs. We can add a new index to > speed this up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6685) Use DSYRK to compute AtA in ALS
[ https://issues.apache.org/jira/browse/SPARK-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6685: - Description: Now we use DSPR to compute AtA in ALS, which is a Level 2 BLAS routine. We should switch to DSYRK to use native BLAS to accelerate the computation. The factors should remain dense vectors. And we can pre-allocate a buffer to stack vectors and do Level 3. This requires some benchmark to demonstrate the improvement. (was: Now we use DSPR to compute AtA in ALS, which is a Level 2 BLAS routine. We should switch to DSYRK to use native BLAS to accelerate the computation. The factors should remain dense vectors. And we can pre-allocate a buffer to stack vectors and do Level 3.) > Use DSYRK to compute AtA in ALS > --- > > Key: SPARK-6685 > URL: https://issues.apache.org/jira/browse/SPARK-6685 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Xiangrui Meng >Priority: Minor > > Now we use DSPR to compute AtA in ALS, which is a Level 2 BLAS routine. We > should switch to DSYRK to use native BLAS to accelerate the computation. The > factors should remain dense vectors. And we can pre-allocate a buffer to > stack vectors and do Level 3. This requires some benchmark to demonstrate the > improvement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6685) Use DSYRK to compute AtA in ALS
Xiangrui Meng created SPARK-6685: Summary: Use DSYRK to compute AtA in ALS Key: SPARK-6685 URL: https://issues.apache.org/jira/browse/SPARK-6685 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Priority: Minor Now we use DSPR to compute AtA in ALS, which is a Level 2 BLAS routine. We should switch to DSYRK to use native BLAS to accelerate the computation. The factors should remain dense vectors. And we can pre-allocate a buffer to stack vectors and do Level 3. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6671) Add status command for spark daemons
[ https://issues.apache.org/jira/browse/SPARK-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6671: --- Assignee: Apache Spark > Add status command for spark daemons > > > Key: SPARK-6671 > URL: https://issues.apache.org/jira/browse/SPARK-6671 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: PRADEEP CHANUMOLU >Assignee: Apache Spark > Labels: easyfix > Fix For: 1.4.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently using the spark-daemon.sh script we can start and stop the spark > demons. But we cannot get the status of the daemons. It will be nice to > include the status command in the spark-daemon.sh script, through which we > can know if the spark demon is alive or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6671) Add status command for spark daemons
[ https://issues.apache.org/jira/browse/SPARK-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6671: --- Assignee: (was: Apache Spark) > Add status command for spark daemons > > > Key: SPARK-6671 > URL: https://issues.apache.org/jira/browse/SPARK-6671 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: PRADEEP CHANUMOLU > Labels: easyfix > Fix For: 1.4.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently using the spark-daemon.sh script we can start and stop the spark > demons. But we cannot get the status of the daemons. It will be nice to > include the status command in the spark-daemon.sh script, through which we > can know if the spark demon is alive or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6479) Create off-heap block storage API (internal)
[ https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393331#comment-14393331 ] Reynold Xin commented on SPARK-6479: Looks good overall. I have some comments but those might be best in the form of code reviews. The high level feedback if that while you are at it, it would be great to document the failure/exception semantics. > Create off-heap block storage API (internal) > > > Key: SPARK-6479 > URL: https://issues.apache.org/jira/browse/SPARK-6479 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Reporter: Reynold Xin > Attachments: SPARK-6479.pdf, SparkOffheapsupportbyHDFS.pdf > > > Would be great to create APIs for off-heap block stores, rather than doing a > bunch of if statements everywhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6414) Spark driver failed with NPE on job cancelation
[ https://issues.apache.org/jira/browse/SPARK-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-6414: -- Assignee: Hung Lin > Spark driver failed with NPE on job cancelation > --- > > Key: SPARK-6414 > URL: https://issues.apache.org/jira/browse/SPARK-6414 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.2.1, 1.3.0 >Reporter: Yuri Makhno >Assignee: Hung Lin >Priority: Critical > Fix For: 1.2.2, 1.3.1, 1.4.0 > > > When a job group is cancelled, we scan through all jobs to determine which > are members of the group. This scan assumes that the job group property is > always set. If 'properties' is null in an active job, you get an NPE. > We just need to make sure we ignore ones where activeJob.properties is null. > We should also make sure it works if the particular property is missing. > https://github.com/apache/spark/blob/branch-1.3/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L678 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6414) Spark driver failed with NPE on job cancelation
[ https://issues.apache.org/jira/browse/SPARK-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-6414. --- Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 1.2.2 Fixed in 1.2.2, 1.3.1, and 1.4.0. > Spark driver failed with NPE on job cancelation > --- > > Key: SPARK-6414 > URL: https://issues.apache.org/jira/browse/SPARK-6414 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.2.1, 1.3.0 >Reporter: Yuri Makhno >Assignee: Hung Lin >Priority: Critical > Fix For: 1.2.2, 1.3.1, 1.4.0 > > > When a job group is cancelled, we scan through all jobs to determine which > are members of the group. This scan assumes that the job group property is > always set. If 'properties' is null in an active job, you get an NPE. > We just need to make sure we ignore ones where activeJob.properties is null. > We should also make sure it works if the particular property is missing. > https://github.com/apache/spark/blob/branch-1.3/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L678 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4346) YarnClientSchedulerBack.asyncMonitorApplication should be common with Client.monitorApplication
[ https://issues.apache.org/jira/browse/SPARK-4346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4346: - Affects Version/s: 1.0.0 > YarnClientSchedulerBack.asyncMonitorApplication should be common with > Client.monitorApplication > --- > > Key: SPARK-4346 > URL: https://issues.apache.org/jira/browse/SPARK-4346 > Project: Spark > Issue Type: Improvement > Components: Scheduler, YARN >Affects Versions: 1.0.0 >Reporter: Thomas Graves > > The YarnClientSchedulerBackend.asyncMonitorApplication routine should move > into ClientBase and be made common with monitorApplication. Make sure stop > is handled properly. > See discussion on https://github.com/apache/spark/pull/3143 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6578) Outbound channel in network library is not thread-safe, can lead to fetch failures
[ https://issues.apache.org/jira/browse/SPARK-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6578: --- Fix Version/s: 1.2.2 > Outbound channel in network library is not thread-safe, can lead to fetch > failures > -- > > Key: SPARK-6578 > URL: https://issues.apache.org/jira/browse/SPARK-6578 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Blocker > Fix For: 1.2.2, 1.3.1, 1.4.0 > > > There is a very narrow race in the outbound channel of the network library. > While netty guarantees that the inbound channel is thread-safe, the same is > not true for the outbound channel: multiple threads can be writing and > running the pipeline at the same time. > This leads to an issue with MessageEncoder and the optimization it performs > for zero-copy of file data: since a single RPC can be broken into multiple > buffers (for , example when replying to a chunk request), if you have > multiple threads writing these RPCs then they can be mixed up in the final > socket. That breaks framing and will cause the receiving side to not > understand the messages. > Patch coming up shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6578) Outbound channel in network library is not thread-safe, can lead to fetch failures
[ https://issues.apache.org/jira/browse/SPARK-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6578: --- Target Version/s: 1.2.2, 1.3.1, 1.4.0 (was: 1.3.1, 1.4.0) > Outbound channel in network library is not thread-safe, can lead to fetch > failures > -- > > Key: SPARK-6578 > URL: https://issues.apache.org/jira/browse/SPARK-6578 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Blocker > Fix For: 1.2.2, 1.3.1, 1.4.0 > > > There is a very narrow race in the outbound channel of the network library. > While netty guarantees that the inbound channel is thread-safe, the same is > not true for the outbound channel: multiple threads can be writing and > running the pipeline at the same time. > This leads to an issue with MessageEncoder and the optimization it performs > for zero-copy of file data: since a single RPC can be broken into multiple > buffers (for , example when replying to a chunk request), if you have > multiple threads writing these RPCs then they can be mixed up in the final > socket. That breaks framing and will cause the receiving side to not > understand the messages. > Patch coming up shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6443) Could not submit app in standalone cluster mode when HA is enabled
[ https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6443: - Affects Version/s: 1.0.0 > Could not submit app in standalone cluster mode when HA is enabled > -- > > Key: SPARK-6443 > URL: https://issues.apache.org/jira/browse/SPARK-6443 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.0.0 >Reporter: Tao Wang >Priority: Critical > > After digging some codes, I found user could not submit app in standalone > cluster mode when HA is enabled. But in client mode it can work. > Haven't try yet. But I will verify this and file a PR to resolve it if the > problem exists. > 3/23 update: > I started a HA cluster with zk, and tried to submit SparkPi example with > command: > ./spark-submit --class org.apache.spark.examples.SparkPi --master > spark://doggie153:7077,doggie159:7077 --deploy-mode cluster > ../lib/spark-examples-1.2.0-hadoop2.4.0.jar > and it failed with error message: > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: > spark://doggie153:7077,doggie159:7077 > akka.actor.ActorInitializationException: exception during creation > at akka.actor.ActorInitializationException$.apply(Actor.scala:164) > at akka.actor.ActorCell.create(ActorCell.scala:596) > at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) > at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) > at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: org.apache.spark.SparkException: Invalid master URL: > spark://doggie153:7077,doggie159:7077 > at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830) > at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42) > at akka.actor.Actor$class.aroundPreStart(Actor.scala:470) > at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35) > at akka.actor.ActorCell.create(ActorCell.scala:580) > ... 9 more > But in client mode it ended with correct result. So my guess is right. I will > fix it in the related PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6675) HiveContext setConf is not stable
[ https://issues.apache.org/jira/browse/SPARK-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6675: Labels: (was: patch) > HiveContext setConf is not stable > - > > Key: SPARK-6675 > URL: https://issues.apache.org/jira/browse/SPARK-6675 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: AWS ec2 xlarge2 cluster launched by spark's script >Reporter: Julien > > I find HiveContext.setConf does not work correctly. Here are some code > snippets showing the problem: > snippet 1: > {code} > import org.apache.spark.sql.hive.HiveContext > import org.apache.spark.{SparkConf, SparkContext} > object Main extends App { > val conf = new SparkConf() > .setAppName("context-test") > .setMaster("local[8]") > val sc = new SparkContext(conf) > val hc = new HiveContext(sc) > hc.setConf("spark.sql.shuffle.partitions", "10") > hc.setConf("hive.metastore.warehouse.dir", > "/home/spark/hive/warehouse_test") > hc.getAllConfs filter(_._1.contains("warehouse.dir")) foreach println > hc.getAllConfs filter(_._1.contains("shuffle.partitions")) foreach println > } > {code} > Results: > (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test) > (spark.sql.shuffle.partitions,10) > snippet 2: > {code} > ... > hc.setConf("hive.metastore.warehouse.dir", > "/home/spark/hive/warehouse_test") > hc.setConf("spark.sql.shuffle.partitions", "10") > hc.getAllConfs filter(_._1.contains("warehouse.dir")) foreach println > hc.getAllConfs filter(_._1.contains("shuffle.partitions")) foreach println > ... > {code} > Results: > (hive.metastore.warehouse.dir,/user/hive/warehouse) > (spark.sql.shuffle.partitions,10) > You can see that I just permuted the two setConf call, then that leads to two > different Hive configuration. > It seems that HiveContext can not set a new value on > "hive.metastore.warehouse.dir" key in one or the first "setConf" call. > You need another "setConf" call before changing > "hive.metastore.warehouse.dir". For example, set > "hive.metastore.warehouse.dir" twice and the snippet 1 > snippet 3: > {code} > ... > hc.setConf("hive.metastore.warehouse.dir", > "/home/spark/hive/warehouse_test") > hc.setConf("hive.metastore.warehouse.dir", > "/home/spark/hive/warehouse_test") > hc.getAllConfs filter(_._1.contains("warehouse.dir")) foreach println > ... > {code} > Results: > (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test) > You can reproduce this if you move to the latest branch-1.3 (1.3.1-snapshot, > htag = 7d029cb1eb6f1df1bce1a3f5784fb7ce2f981a33) > I have also tested the released 1.3.0 (htag = > 4aaf48d46d13129f0f9bdafd771dd80fe568a7dc). It has the same problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6675) HiveContext setConf is not stable
[ https://issues.apache.org/jira/browse/SPARK-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6675: Target Version/s: 1.4.0 (was: 1.3.0) > HiveContext setConf is not stable > - > > Key: SPARK-6675 > URL: https://issues.apache.org/jira/browse/SPARK-6675 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: AWS ec2 xlarge2 cluster launched by spark's script >Reporter: Julien > > I find HiveContext.setConf does not work correctly. Here are some code > snippets showing the problem: > snippet 1: > {code} > import org.apache.spark.sql.hive.HiveContext > import org.apache.spark.{SparkConf, SparkContext} > object Main extends App { > val conf = new SparkConf() > .setAppName("context-test") > .setMaster("local[8]") > val sc = new SparkContext(conf) > val hc = new HiveContext(sc) > hc.setConf("spark.sql.shuffle.partitions", "10") > hc.setConf("hive.metastore.warehouse.dir", > "/home/spark/hive/warehouse_test") > hc.getAllConfs filter(_._1.contains("warehouse.dir")) foreach println > hc.getAllConfs filter(_._1.contains("shuffle.partitions")) foreach println > } > {code} > Results: > (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test) > (spark.sql.shuffle.partitions,10) > snippet 2: > {code} > ... > hc.setConf("hive.metastore.warehouse.dir", > "/home/spark/hive/warehouse_test") > hc.setConf("spark.sql.shuffle.partitions", "10") > hc.getAllConfs filter(_._1.contains("warehouse.dir")) foreach println > hc.getAllConfs filter(_._1.contains("shuffle.partitions")) foreach println > ... > {code} > Results: > (hive.metastore.warehouse.dir,/user/hive/warehouse) > (spark.sql.shuffle.partitions,10) > You can see that I just permuted the two setConf call, then that leads to two > different Hive configuration. > It seems that HiveContext can not set a new value on > "hive.metastore.warehouse.dir" key in one or the first "setConf" call. > You need another "setConf" call before changing > "hive.metastore.warehouse.dir". For example, set > "hive.metastore.warehouse.dir" twice and the snippet 1 > snippet 3: > {code} > ... > hc.setConf("hive.metastore.warehouse.dir", > "/home/spark/hive/warehouse_test") > hc.setConf("hive.metastore.warehouse.dir", > "/home/spark/hive/warehouse_test") > hc.getAllConfs filter(_._1.contains("warehouse.dir")) foreach println > ... > {code} > Results: > (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test) > You can reproduce this if you move to the latest branch-1.3 (1.3.1-snapshot, > htag = 7d029cb1eb6f1df1bce1a3f5784fb7ce2f981a33) > I have also tested the released 1.3.0 (htag = > 4aaf48d46d13129f0f9bdafd771dd80fe568a7dc). It has the same problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6443) Support HA in standalone cluster modehen HA is enabled
[ https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6443: - Summary: Support HA in standalone cluster modehen HA is enabled (was: Could not submit app in standalone cluster mode when HA is enabled) > Support HA in standalone cluster modehen HA is enabled > -- > > Key: SPARK-6443 > URL: https://issues.apache.org/jira/browse/SPARK-6443 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.0.0 >Reporter: Tao Wang >Priority: Critical > > After digging some codes, I found user could not submit app in standalone > cluster mode when HA is enabled. But in client mode it can work. > Haven't try yet. But I will verify this and file a PR to resolve it if the > problem exists. > 3/23 update: > I started a HA cluster with zk, and tried to submit SparkPi example with > command: > ./spark-submit --class org.apache.spark.examples.SparkPi --master > spark://doggie153:7077,doggie159:7077 --deploy-mode cluster > ../lib/spark-examples-1.2.0-hadoop2.4.0.jar > and it failed with error message: > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: > spark://doggie153:7077,doggie159:7077 > akka.actor.ActorInitializationException: exception during creation > at akka.actor.ActorInitializationException$.apply(Actor.scala:164) > at akka.actor.ActorCell.create(ActorCell.scala:596) > at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) > at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) > at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: org.apache.spark.SparkException: Invalid master URL: > spark://doggie153:7077,doggie159:7077 > at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830) > at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42) > at akka.actor.Actor$class.aroundPreStart(Actor.scala:470) > at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35) > at akka.actor.ActorCell.create(ActorCell.scala:580) > ... 9 more > But in client mode it ended with correct result. So my guess is right. I will > fix it in the related PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6443) Support HA in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6443: - Priority: Major (was: Critical) > Support HA in standalone cluster mode > - > > Key: SPARK-6443 > URL: https://issues.apache.org/jira/browse/SPARK-6443 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.0.0 >Reporter: Tao Wang > > After digging some codes, I found user could not submit app in standalone > cluster mode when HA is enabled. But in client mode it can work. > Haven't try yet. But I will verify this and file a PR to resolve it if the > problem exists. > 3/23 update: > I started a HA cluster with zk, and tried to submit SparkPi example with > command: > ./spark-submit --class org.apache.spark.examples.SparkPi --master > spark://doggie153:7077,doggie159:7077 --deploy-mode cluster > ../lib/spark-examples-1.2.0-hadoop2.4.0.jar > and it failed with error message: > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: > spark://doggie153:7077,doggie159:7077 > akka.actor.ActorInitializationException: exception during creation > at akka.actor.ActorInitializationException$.apply(Actor.scala:164) > at akka.actor.ActorCell.create(ActorCell.scala:596) > at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) > at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) > at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: org.apache.spark.SparkException: Invalid master URL: > spark://doggie153:7077,doggie159:7077 > at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830) > at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42) > at akka.actor.Actor$class.aroundPreStart(Actor.scala:470) > at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35) > at akka.actor.ActorCell.create(ActorCell.scala:580) > ... 9 more > But in client mode it ended with correct result. So my guess is right. I will > fix it in the related PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6443) Support HA in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6443: - Summary: Support HA in standalone cluster mode (was: Support HA in standalone cluster modehen HA is enabled) > Support HA in standalone cluster mode > - > > Key: SPARK-6443 > URL: https://issues.apache.org/jira/browse/SPARK-6443 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.0.0 >Reporter: Tao Wang >Priority: Critical > > After digging some codes, I found user could not submit app in standalone > cluster mode when HA is enabled. But in client mode it can work. > Haven't try yet. But I will verify this and file a PR to resolve it if the > problem exists. > 3/23 update: > I started a HA cluster with zk, and tried to submit SparkPi example with > command: > ./spark-submit --class org.apache.spark.examples.SparkPi --master > spark://doggie153:7077,doggie159:7077 --deploy-mode cluster > ../lib/spark-examples-1.2.0-hadoop2.4.0.jar > and it failed with error message: > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: > spark://doggie153:7077,doggie159:7077 > akka.actor.ActorInitializationException: exception during creation > at akka.actor.ActorInitializationException$.apply(Actor.scala:164) > at akka.actor.ActorCell.create(ActorCell.scala:596) > at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) > at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) > at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: org.apache.spark.SparkException: Invalid master URL: > spark://doggie153:7077,doggie159:7077 > at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830) > at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42) > at akka.actor.Actor$class.aroundPreStart(Actor.scala:470) > at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35) > at akka.actor.ActorCell.create(ActorCell.scala:580) > ... 9 more > But in client mode it ended with correct result. So my guess is right. I will > fix it in the related PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6669) Lock metastore client in analyzeTable
[ https://issues.apache.org/jira/browse/SPARK-6669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6669: --- Assignee: Apache Spark (was: Michael Armbrust) > Lock metastore client in analyzeTable > - > > Key: SPARK-6669 > URL: https://issues.apache.org/jira/browse/SPARK-6669 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0 >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6669) Lock metastore client in analyzeTable
[ https://issues.apache.org/jira/browse/SPARK-6669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393582#comment-14393582 ] Apache Spark commented on SPARK-6669: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/5333 > Lock metastore client in analyzeTable > - > > Key: SPARK-6669 > URL: https://issues.apache.org/jira/browse/SPARK-6669 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0 >Reporter: Yin Huai >Assignee: Michael Armbrust >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6669) Lock metastore client in analyzeTable
[ https://issues.apache.org/jira/browse/SPARK-6669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6669: --- Assignee: Michael Armbrust (was: Apache Spark) > Lock metastore client in analyzeTable > - > > Key: SPARK-6669 > URL: https://issues.apache.org/jira/browse/SPARK-6669 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0 >Reporter: Yin Huai >Assignee: Michael Armbrust >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6479) Create off-heap block storage API (internal)
[ https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-6479: -- Attachment: SPARK-6479OffheapAPIdesign.pdf Add failure case handling overall design and example. > Create off-heap block storage API (internal) > > > Key: SPARK-6479 > URL: https://issues.apache.org/jira/browse/SPARK-6479 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Reporter: Reynold Xin > Attachments: SPARK-6479.pdf, SPARK-6479OffheapAPIdesign.pdf, > SparkOffheapsupportbyHDFS.pdf > > > Would be great to create APIs for off-heap block stores, rather than doing a > bunch of if statements everywhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6443) Support HA in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6443: - Description: After digging some codes, I found user could not submit app in standalone cluster mode when HA is enabled. But in client mode it can work. Haven't try yet. But I will verify this and file a PR to resolve it if the problem exists. 3/23 update: I started a HA cluster with zk, and tried to submit SparkPi example with command: ./spark-submit --class org.apache.spark.examples.SparkPi --master spark://doggie153:7077,doggie159:7077 --deploy-mode cluster ../lib/spark-examples-1.2.0-hadoop2.4.0.jar and it failed with error message: Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: spark://doggie153:7077,doggie159:7077 akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:164) at akka.actor.ActorCell.create(ActorCell.scala:596) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.spark.SparkException: Invalid master URL: spark://doggie153:7077,doggie159:7077 at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830) at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42) at akka.actor.Actor$class.aroundPreStart(Actor.scala:470) at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35) at akka.actor.ActorCell.create(ActorCell.scala:580) ... 9 more But in client mode it ended with correct result. So my guess is right. I will fix it in the related PR. === EDIT by Andrew === >From a quick survey in the code I can confirm that client mode does support >this. [This >line|https://github.com/apache/spark/blob/e3202aa2e9bd140effbcf2a7a02b90cb077e760b/core/src/main/scala/org/apache/spark/SparkContext.scala#L2162] > splits the master URLs by comma and passes these URLs into the AppClient. In >standalone cluster mode, there is not equivalent logic to even split the >master URLs, whether in the old submission gateway (o.a.s.deploy.Client) or in >the new one (o.a.s.deploy.rest.StandaloneRestClient). Thus, this is an unsupported feature, not a bug! was: After digging some codes, I found user could not submit app in standalone cluster mode when HA is enabled. But in client mode it can work. Haven't try yet. But I will verify this and file a PR to resolve it if the problem exists. 3/23 update: I started a HA cluster with zk, and tried to submit SparkPi example with command: ./spark-submit --class org.apache.spark.examples.SparkPi --master spark://doggie153:7077,doggie159:7077 --deploy-mode cluster ../lib/spark-examples-1.2.0-hadoop2.4.0.jar and it failed with error message: Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: spark://doggie153:7077,doggie159:7077 akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:164) at akka.actor.ActorCell.create(ActorCell.scala:596) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.spark.SparkException: Invalid master URL: spark://doggie153:7077,doggie159:7077 at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830) at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42) at akka.actor.Actor$class.ar
[jira] [Comment Edited] (SPARK-6479) Create off-heap block storage API (internal)
[ https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393590#comment-14393590 ] Zhan Zhang edited comment on SPARK-6479 at 4/2/15 10:23 PM: [~rxin] Thanks for the feedback. I updated the document with the basic design and example on handling failure case thrown by underlying implementation. Basically, the shim layer handles all the exceptions/failure cases and return expected failure value to upper layer, similar to below example. override def getBytes(blockId: BlockId): Option[ByteBuffer] = { try { offHeapManager.flatMap(_.getBytes(blockId)) } catch { case _ =>logError(s"error in getBytes from $blockId") None } } was (Author: zzhan): [~rxin] Thanks for the feedback. I updated the document with the basic design and example on handling failure case thrown by underlying implementation. Basically, the shim layer handles all the exceptions/failure cases and return expected failure value to upper layer, similar to below example. override def getBytes(blockId: BlockId): Option[ByteBuffer] = { try { offHeapManager.flatMap(_.getBytes(blockId)) } catch { case _ =>logError(s"error in getBytes from $blockId") None } } > Create off-heap block storage API (internal) > > > Key: SPARK-6479 > URL: https://issues.apache.org/jira/browse/SPARK-6479 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Reporter: Reynold Xin > Attachments: SPARK-6479.pdf, SPARK-6479OffheapAPIdesign.pdf, > SparkOffheapsupportbyHDFS.pdf > > > Would be great to create APIs for off-heap block stores, rather than doing a > bunch of if statements everywhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6479) Create off-heap block storage API (internal)
[ https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393590#comment-14393590 ] Zhan Zhang commented on SPARK-6479: --- [~rxin] Thanks for the feedback. I updated the document with the basic design and example on handling failure case thrown by underlying implementation. Basically, the shim layer handles all the exceptions/failure cases and return expected failure value to upper layer, similar to below example. override def getBytes(blockId: BlockId): Option[ByteBuffer] = { try { offHeapManager.flatMap(_.getBytes(blockId)) } catch { case _ =>logError(s"error in getBytes from $blockId") None } } > Create off-heap block storage API (internal) > > > Key: SPARK-6479 > URL: https://issues.apache.org/jira/browse/SPARK-6479 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Reporter: Reynold Xin > Attachments: SPARK-6479.pdf, SPARK-6479OffheapAPIdesign.pdf, > SparkOffheapsupportbyHDFS.pdf > > > Would be great to create APIs for off-heap block stores, rather than doing a > bunch of if statements everywhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6479) Create off-heap block storage API (internal)
[ https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393590#comment-14393590 ] Zhan Zhang edited comment on SPARK-6479 at 4/2/15 10:24 PM: [~rxin] Thanks for the feedback. I updated the document with the basic design and example on handling failure case thrown by underlying implementation. Basically, the shim layer handles all the exceptions/failure cases and return expected failure value to upper layer, similar to below example. override def getBytes(blockId: BlockId): Option[ByteBuffer] = { try { offHeapManager.flatMap(_.getBytes(blockId)) } catch { case _ =>logError(s"error in getBytes from $blockId") None } } was (Author: zzhan): [~rxin] Thanks for the feedback. I updated the document with the basic design and example on handling failure case thrown by underlying implementation. Basically, the shim layer handles all the exceptions/failure cases and return expected failure value to upper layer, similar to below example. override def getBytes(blockId: BlockId): Option[ByteBuffer] = { try { offHeapManager.flatMap(_.getBytes(blockId)) } catch { case _ =>logError(s"error in getBytes from $blockId") None } } > Create off-heap block storage API (internal) > > > Key: SPARK-6479 > URL: https://issues.apache.org/jira/browse/SPARK-6479 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Reporter: Reynold Xin > Attachments: SPARK-6479.pdf, SPARK-6479OffheapAPIdesign.pdf, > SparkOffheapsupportbyHDFS.pdf > > > Would be great to create APIs for off-heap block stores, rather than doing a > bunch of if statements everywhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6443) Support HA in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6443: - Description: == EDIT by Andrew == >From a quick survey in the code I can confirm that client mode does support >this. [This >line|https://github.com/apache/spark/blob/e3202aa2e9bd140effbcf2a7a02b90cb077e760b/core/src/main/scala/org/apache/spark/SparkContext.scala#L2162] > splits the master URLs by comma and passes these URLs into the AppClient. In >standalone cluster mode, there is simply no equivalent logic to even split the >master URLs, whether in the old submission gateway (o.a.s.deploy.Client) or in >the new one (o.a.s.deploy.rest.StandaloneRestClient). Thus, this is an unsupported feature, not a bug! == Original description from Tao Wang == After digging some codes, I found user could not submit app in standalone cluster mode when HA is enabled. But in client mode it can work. Haven't try yet. But I will verify this and file a PR to resolve it if the problem exists. 3/23 update: I started a HA cluster with zk, and tried to submit SparkPi example with command: ./spark-submit --class org.apache.spark.examples.SparkPi --master spark://doggie153:7077,doggie159:7077 --deploy-mode cluster ../lib/spark-examples-1.2.0-hadoop2.4.0.jar and it failed with error message: Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: spark://doggie153:7077,doggie159:7077 akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:164) at akka.actor.ActorCell.create(ActorCell.scala:596) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.spark.SparkException: Invalid master URL: spark://doggie153:7077,doggie159:7077 at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830) at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42) at akka.actor.Actor$class.aroundPreStart(Actor.scala:470) at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35) at akka.actor.ActorCell.create(ActorCell.scala:580) ... 9 more But in client mode it ended with correct result. So my guess is right. I will fix it in the related PR. was: == EDIT by Andrew == >From a quick survey in the code I can confirm that client mode does support >this. [This >line|https://github.com/apache/spark/blob/e3202aa2e9bd140effbcf2a7a02b90cb077e760b/core/src/main/scala/org/apache/spark/SparkContext.scala#L2162] > splits the master URLs by comma and passes these URLs into the AppClient. In >standalone cluster mode, there is not equivalent logic to even split the >master URLs, whether in the old submission gateway (o.a.s.deploy.Client) or in >the new one (o.a.s.deploy.rest.StandaloneRestClient). Thus, this is an unsupported feature, not a bug! == Original description from Tao Wang == After digging some codes, I found user could not submit app in standalone cluster mode when HA is enabled. But in client mode it can work. Haven't try yet. But I will verify this and file a PR to resolve it if the problem exists. 3/23 update: I started a HA cluster with zk, and tried to submit SparkPi example with command: ./spark-submit --class org.apache.spark.examples.SparkPi --master spark://doggie153:7077,doggie159:7077 --deploy-mode cluster ../lib/spark-examples-1.2.0-hadoop2.4.0.jar and it failed with error message: Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: spark://doggie153:7077,doggie159:7077 akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:164) at akka.actor.ActorCell.create(ActorCell.scala:596) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala
[jira] [Updated] (SPARK-6443) Support HA in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6443: - Description: == EDIT by Andrew == >From a quick survey in the code I can confirm that client mode does support >this. [This >line|https://github.com/apache/spark/blob/e3202aa2e9bd140effbcf2a7a02b90cb077e760b/core/src/main/scala/org/apache/spark/SparkContext.scala#L2162] > splits the master URLs by comma and passes these URLs into the AppClient. In >standalone cluster mode, there is not equivalent logic to even split the >master URLs, whether in the old submission gateway (o.a.s.deploy.Client) or in >the new one (o.a.s.deploy.rest.StandaloneRestClient). Thus, this is an unsupported feature, not a bug! == Original description from Tao Wang == After digging some codes, I found user could not submit app in standalone cluster mode when HA is enabled. But in client mode it can work. Haven't try yet. But I will verify this and file a PR to resolve it if the problem exists. 3/23 update: I started a HA cluster with zk, and tried to submit SparkPi example with command: ./spark-submit --class org.apache.spark.examples.SparkPi --master spark://doggie153:7077,doggie159:7077 --deploy-mode cluster ../lib/spark-examples-1.2.0-hadoop2.4.0.jar and it failed with error message: Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: spark://doggie153:7077,doggie159:7077 akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:164) at akka.actor.ActorCell.create(ActorCell.scala:596) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.spark.SparkException: Invalid master URL: spark://doggie153:7077,doggie159:7077 at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830) at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42) at akka.actor.Actor$class.aroundPreStart(Actor.scala:470) at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35) at akka.actor.ActorCell.create(ActorCell.scala:580) ... 9 more But in client mode it ended with correct result. So my guess is right. I will fix it in the related PR. was: After digging some codes, I found user could not submit app in standalone cluster mode when HA is enabled. But in client mode it can work. Haven't try yet. But I will verify this and file a PR to resolve it if the problem exists. 3/23 update: I started a HA cluster with zk, and tried to submit SparkPi example with command: ./spark-submit --class org.apache.spark.examples.SparkPi --master spark://doggie153:7077,doggie159:7077 --deploy-mode cluster ../lib/spark-examples-1.2.0-hadoop2.4.0.jar and it failed with error message: Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: spark://doggie153:7077,doggie159:7077 akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:164) at akka.actor.ActorCell.create(ActorCell.scala:596) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.spark.SparkException: Invalid master URL: spark://doggie153:7077,doggie159:7077 at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830) at org.apache.spark.deploy.Cl
[jira] [Assigned] (SPARK-2669) Hadoop configuration is not localised when submitting job in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2669: --- Assignee: (was: Apache Spark) > Hadoop configuration is not localised when submitting job in yarn-cluster mode > -- > > Key: SPARK-2669 > URL: https://issues.apache.org/jira/browse/SPARK-2669 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.0.0 >Reporter: Maxim Ivanov > > I'd like to propose a fix for a problem when Hadoop configuration is not > localized when job is submitted in yarn-cluster mode. Here is a description > from github pull request https://github.com/apache/spark/pull/1574 > This patch fixes a problem when Spark driver is run in the container > managed by YARN ResourceManager it inherits configuration from a > NodeManager process, which can be different from the Hadoop > configuration present on the client (submitting machine). Problem is > most vivid when fs.defaultFS property differs between these two. > Hadoop MR solves it by serializing client's Hadoop configuration into > job.xml in application staging directory and then making Application > Master to use it. That guarantees that regardless of execution nodes > configurations all application containers use same config identical to > one on the client side. > This patch uses similar approach. YARN ClientBase serializes > configuration and adds it to ClientDistributedCacheManager under > "job.xml" link name. ClientDistributedCacheManager is then utilizes > Hadoop localizer to deliver it to whatever container is started by this > application, including the one running Spark driver. > YARN ClientBase also adds "SPARK_LOCAL_HADOOPCONF" env variable to AM > container request which is then used by SparkHadoopUtil.newConfiguration > to trigger new behavior when machine-wide hadoop configuration is merged > with application specific job.xml (exactly how it is done in Hadoop MR). > SparkContext is then follows same approach, adding > SPARK_LOCAL_HADOOPCONF env to all spawned containers to make them use > client-side Hadopo configuration. > Also all the references to "new Configuration()" which might be executed > on YARN cluster side are changed to use SparkHadoopUtil.get.conf > Please note that it fixes only core Spark, the part which I am > comfortable to test and verify the result. I didn't descend into > steaming/shark directories, so things might need to be changed there too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2669) Hadoop configuration is not localised when submitting job in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2669: --- Assignee: Apache Spark > Hadoop configuration is not localised when submitting job in yarn-cluster mode > -- > > Key: SPARK-2669 > URL: https://issues.apache.org/jira/browse/SPARK-2669 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.0.0 >Reporter: Maxim Ivanov >Assignee: Apache Spark > > I'd like to propose a fix for a problem when Hadoop configuration is not > localized when job is submitted in yarn-cluster mode. Here is a description > from github pull request https://github.com/apache/spark/pull/1574 > This patch fixes a problem when Spark driver is run in the container > managed by YARN ResourceManager it inherits configuration from a > NodeManager process, which can be different from the Hadoop > configuration present on the client (submitting machine). Problem is > most vivid when fs.defaultFS property differs between these two. > Hadoop MR solves it by serializing client's Hadoop configuration into > job.xml in application staging directory and then making Application > Master to use it. That guarantees that regardless of execution nodes > configurations all application containers use same config identical to > one on the client side. > This patch uses similar approach. YARN ClientBase serializes > configuration and adds it to ClientDistributedCacheManager under > "job.xml" link name. ClientDistributedCacheManager is then utilizes > Hadoop localizer to deliver it to whatever container is started by this > application, including the one running Spark driver. > YARN ClientBase also adds "SPARK_LOCAL_HADOOPCONF" env variable to AM > container request which is then used by SparkHadoopUtil.newConfiguration > to trigger new behavior when machine-wide hadoop configuration is merged > with application specific job.xml (exactly how it is done in Hadoop MR). > SparkContext is then follows same approach, adding > SPARK_LOCAL_HADOOPCONF env to all spawned containers to make them use > client-side Hadopo configuration. > Also all the references to "new Configuration()" which might be executed > on YARN cluster side are changed to use SparkHadoopUtil.get.conf > Please note that it fixes only core Spark, the part which I am > comfortable to test and verify the result. I didn't descend into > steaming/shark directories, so things might need to be changed there too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org