date:20150402

[jira] [Assigned] (SPARK-5972) Cache residuals for GradientBoostedTrees during training

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5972:
---

Assignee: (was: Apache Spark)

> Cache residuals for GradientBoostedTrees during training
> 
>
> Key: SPARK-5972
> URL: https://issues.apache.org/jira/browse/SPARK-5972
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In gradient boosting, the current model's prediction is re-computed for each 
> training instance on every iteration.  The current residual (cumulative 
> prediction of previously trained trees in the ensemble) should be cached.  
> That could reduce both computation (only computing the prediction of the most 
> recently trained tree) and communication (only sending the most recently 
> trained tree to the workers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5972) Cache residuals for GradientBoostedTrees during training

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5972:
---

Assignee: Apache Spark

> Cache residuals for GradientBoostedTrees during training
> 
>
> Key: SPARK-5972
> URL: https://issues.apache.org/jira/browse/SPARK-5972
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> In gradient boosting, the current model's prediction is re-computed for each 
> training instance on every iteration.  The current residual (cumulative 
> prediction of previously trained trees in the ensemble) should be cached.  
> That could reduce both computation (only computing the prediction of the most 
> recently trained tree) and communication (only sending the most recently 
> trained tree to the workers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6676) Add hadoop 2.4+ for profiles in POM.xml

2015-04-02 Thread Zhang, Liye (JIRA)

Zhang, Liye created SPARK-6676:
--

 Summary: Add hadoop 2.4+ for profiles in POM.xml
 Key: SPARK-6676
 URL: https://issues.apache.org/jira/browse/SPARK-6676
 Project: Spark
  Issue Type: Improvement
  Components: Build, Tests
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor


support *-Phadoop-2.5* and *-Phadoop-2.6* when building and testing Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6676) Add hadoop 2.4+ for profiles in POM.xml

2015-04-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392466#comment-14392466
 ] 

Apache Spark commented on SPARK-6676:
-

User 'liyezhang556520' has created a pull request for this issue:
https://github.com/apache/spark/pull/5331

> Add hadoop 2.4+ for profiles in POM.xml
> ---
>
> Key: SPARK-6676
> URL: https://issues.apache.org/jira/browse/SPARK-6676
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 1.3.0
>Reporter: Zhang, Liye
>Priority: Minor
>
> support *-Phadoop-2.5* and *-Phadoop-2.6* when building and testing Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6676) Add hadoop 2.4+ for profiles in POM.xml

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6676:
---

Assignee: Apache Spark

> Add hadoop 2.4+ for profiles in POM.xml
> ---
>
> Key: SPARK-6676
> URL: https://issues.apache.org/jira/browse/SPARK-6676
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 1.3.0
>Reporter: Zhang, Liye
>Assignee: Apache Spark
>Priority: Minor
>
> support *-Phadoop-2.5* and *-Phadoop-2.6* when building and testing Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6676) Add hadoop 2.4+ for profiles in POM.xml

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6676:
---

Assignee: (was: Apache Spark)

> Add hadoop 2.4+ for profiles in POM.xml
> ---
>
> Key: SPARK-6676
> URL: https://issues.apache.org/jira/browse/SPARK-6676
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 1.3.0
>Reporter: Zhang, Liye
>Priority: Minor
>
> support *-Phadoop-2.5* and *-Phadoop-2.6* when building and testing Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-04-02 Thread Sudharma Puranik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392474#comment-14392474
 ] 

Sudharma Puranik commented on SPARK-2243:
-

[~jahubba] : Running on the seperate JVMs is not a workaround but I its about 
sharing the SparkContext understanding your processor and memory. Nonetheless 
Running multiple sparkcontexts on JVM is total different aspect which they are 
still skeptical or whatsoever the reason. But to set the things straight, its 
not workaround.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(Delega

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-04-02 Thread Sudharma Puranik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392475#comment-14392475
 ] 

Sudharma Puranik commented on SPARK-2243:
-

[~jahubba] : Running on the seperate JVMs is not a workaround but I its about 
sharing the SparkContext understanding your processor and memory. Nonetheless 
Running multiple sparkcontexts on JVM is total different aspect which they are 
still skeptical or whatsoever the reason. But to set the things straight, its 
not workaround.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(Delega

[jira] [Issue Comment Deleted] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-04-02 Thread Sudharma Puranik (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sudharma Puranik updated SPARK-2243:

Comment: was deleted

(was: [~jahubba] : Running on the seperate JVMs is not a workaround but I its 
about sharing the SparkContext understanding your processor and memory. 
Nonetheless Running multiple sparkcontexts on JVM is total different aspect 
which they are still skeptical or whatsoever the reason. But to set the things 
straight, its not workaround.)

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.ja

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-04-02 Thread Sudharma Puranik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392519#comment-14392519
 ] 

Sudharma Puranik commented on SPARK-2243:
-

[~sowen] : My reply was for Jason  where he mentioned about workaround. 

And yes I mean running logically distinct apps with logically distinct 
configurations only in terms of cores.  Yes you are right in regard to running 
seperate process. Well , again,  Spark per se, process is nothing but another 
SparkContext, which again has boundaries with cores and memory. :)  I guess we 
are in unison in understanding the implications of multiple SparkContexts on 
same JVM vs sharing multiple sparkContexts across JVMs.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorIm

[jira] [Resolved] (SPARK-6676) Add hadoop 2.4+ for profiles in POM.xml

2015-04-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6676.
--
Resolution: Won't Fix

This is already supported by the hadoop-2.4 profile, which mean "2.4+"

> Add hadoop 2.4+ for profiles in POM.xml
> ---
>
> Key: SPARK-6676
> URL: https://issues.apache.org/jira/browse/SPARK-6676
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 1.3.0
>Reporter: Zhang, Liye
>Priority: Minor
>
> support *-Phadoop-2.5* and *-Phadoop-2.6* when building and testing Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6677) pyspark.sql nondeterministic issue with row fields

2015-04-02 Thread Stefano Parmesan (JIRA)

Stefano Parmesan created SPARK-6677:
---

 Summary: pyspark.sql nondeterministic issue with row fields
 Key: SPARK-6677
 URL: https://issues.apache.org/jira/browse/SPARK-6677
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0
 Environment: spark version: spark-1.3.0-bin-hadoop2.4
python version: Python 2.7.6
operating system: MacOS, x86_64 x86_64 x86_64 GNU/Linux
Reporter: Stefano Parmesan


The following issue happens only when running pyspark in the python 
interpreter, it works correctly with spark-submit.

Reading two json files containing objects with a different structure leads 
sometimes to the definition of wrong Rows, where the fields of a file are used 
for the other one.

I was able to write a sample code that reproduce this issue one out of three 
times; the code snippet is available at the following link, together with some 
(very simple) data samples:

https://gist.github.com/armisael/e08bb4567d0a11efe2db



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-04-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392499#comment-14392499
 ] 

Sean Owen commented on SPARK-2243:
--

The best thing to do is state your use case. Not a workaround for what purpose? 
Is this just because you think it would save some memory? is it about sharing 
some non-Spark resource in memory? separating configuration?

If you want to run logically distinct apps with logically distinct 
configuration, IMHO you should be running separate processes as a matter of 
good design. Yep, you'll end up spending more memory.

Sharing non-Spark state makes more sense. I think you can and even should 
design to not assume these separate jobs share non-persistent state outside of 
Spark, but that is a legitimate question. But certainly you _can_ do this. I 
don't think this makes whole classes of usage impossible or even difficult. 
Which is why I wonder what has "no workaround".

Note that sharing resources in the sense of sharing RDDs actually *requires* 
sharing a {{SparkContext}}. I think cooperating Spark tasks are much more 
likely to want to cooperate this way?

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss

[jira] [Resolved] (SPARK-6672) createDataFrame from RDD[Row] with UDTs cannot be saved

2015-04-02 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6672.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5329
[https://github.com/apache/spark/pull/5329]

> createDataFrame from RDD[Row] with UDTs cannot be saved
> ---
>
> Key: SPARK-6672
> URL: https://issues.apache.org/jira/browse/SPARK-6672
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, SQL
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.4.0
>
>
> Reported by Jaonary 
> (https://www.mail-archive.com/user@spark.apache.org/msg25218.html):
> {code}
> import org.apache.spark.mllib.linalg._
> import org.apache.spark.mllib.regression._
> val df0 = sqlContext.createDataFrame(Seq(LabeledPoint(1.0, Vectors.dense(2.0, 
> 3.0
> df0.save("/tmp/df0") // works
> val df1 = sqlContext.createDataFrame(df0.rdd, df0.schema)
> df1.save("/tmp/df1") // error
> {code}
> throws
> {code}
> 15/04/01 23:24:16 INFO DAGScheduler: Job 3 failed: runJob at 
> newParquet.scala:686, took 0.288304 s
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in 
> stage 3.0 failed 1 times, most recent failure: Lost task 3.0 in stage 3.0 
> (TID 15, localhost): java.lang.ClassCastException: 
> org.apache.spark.mllib.linalg.DenseVector cannot be cast to 
> org.apache.spark.sql.Row
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:191)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:182)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171)
>   at 
> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134)
>   at 
> parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
>   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
>   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:668)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:686)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:686)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:64)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:212)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-04-02 Thread Sudharma Puranik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392519#comment-14392519
 ] 

Sudharma Puranik edited comment on SPARK-2243 at 4/2/15 10:36 AM:
--

[~sowen] : My reply was for Jason  where he mentioned about workaround. 

And yes I mean running logically distinct apps with logically distinct 
configurations only in terms of cores.  Yes you are right in regard to running 
seperate process. Well , again,  Spark per se, process is nothing but another 
{{SparkContext}}, which again has boundaries with cores and memory. :)  I guess 
we are in unison in understanding the implications of multiple {{SparkContext}} 
on same JVM vs sharing multiple {{SparkContext}} across JVMs.


was (Author: sudharma.pura...@gmail.com):
[~sowen] : My reply was for Jason  where he mentioned about workaround. 

And yes I mean running logically distinct apps with logically distinct 
configurations only in terms of cores.  Yes you are right in regard to running 
seperate process. Well , again,  Spark per se, process is nothing but another 
SparkContext, which again has boundaries with cores and memory. :)  I guess we 
are in unison in understanding the implications of multiple SparkContexts on 
same JVM vs sharing multiple sparkContexts across JVMs.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6

[jira] [Created] (SPARK-6678) select count(DISTINCT C_UID) from parquetdir may be can optimize

2015-04-02 Thread Littlestar (JIRA)

Littlestar created SPARK-6678:
-

 Summary: select count(DISTINCT C_UID) from parquetdir may be can 
optimize
 Key: SPARK-6678
 URL: https://issues.apache.org/jira/browse/SPARK-6678
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Littlestar
Priority: Minor


2.2T parquet files(5000 files total, 100 billion records, 2 billion unique 
C_UID).

I run the following sql, may be RDD.collect is very slow 
select count(DISTINCT C_UID) from parquetdir

select count(DISTINCT C_UID) from parquetdir
collect at SparkPlan.scala:83 +details
org.apache.spark.rdd.RDD.collect(RDD.scala:813)
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83)
org.apache.spark.sql.DataFrame.collect(DataFrame.scala:815)
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:178)
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAs(Subject.java:415)
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
com.sun.proxy.$Proxy23.executeStatementAsync(Unknown Source)
org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6679) java.lang.ClassNotFoundException on Mesos fine grained mode and input replication

2015-04-02 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-6679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ondřej Smola updated SPARK-6679:

Description: 
Spark Streaming 1.3.0, Mesos 0.21.1 - Only when using fine grained mode and 
receiver input replication (StorageLevel.MEMORY_ONLY_2, 
StorageLevel.MEMORY_AND_DISK_2). When using coarse grained mode it works. When 
not using replication (StorageLevel.MEMORY_ONLY ...) it works. Error:

15/03/26 14:50:00 ERROR TransportRequestHandler: Error while invoking 
RpcHandler#receive() on RPC id 7178767328921933569
java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:344)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:65)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:88)
at 
org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:65)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:124)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:97)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:91)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:44)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)

  was:
Spark Streaming 1.3.0, Mesos 0.21.1 - When using fine grained mode and receiver 
input replication ( StorageLevel.MEMORY_ONLY_2, StorageLevel.MEMORY_AND_DISK_2) 
following error is thrown. When using coarse grained mode it works. When not 
using replication (StorageLevel.MEMORY_ONLY ...) it works. Error:

15/03/26 14:50:00 ERROR TransportRequestHandler: Error while invoking 
RpcHandler#receive() on RPC id 7178767328921933569
java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:344)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:65)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerial

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-04-02 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392616#comment-14392616
 ] 

sam commented on SPARK-2243:


[~srowen] 

The real issue here and use case, is to be able to change configuration during 
execution without serialization concerns.  It's common for different steps of a 
job to require different amounts of various caches, e.g. part one caches an RDD 
into memory to reduce cost of iterating over it, part two takes the result and 
does some kind of big shuffle.  The second part needs a big shuffle memory 
fraction while the first part might need a large cache memory fraction.  In 
fact I have jobs where I need 0.8 for a big shuffle, 0.8 for some caching, then 
set both to 0 for a part that requires a ton of heap.

Currently the only work around is to write out the results of each part to disk 
(causing serialization faff), and wrap the application in a bash script to run 
it repeatedly with different configuration and parameters.

Creating new SparkContexts would not necessarily solve this since as pointed 
out one would need to be able to share RDDs across contexts.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss

[jira] [Assigned] (SPARK-4449) specify port range in spark

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4449:
---

Assignee: Apache Spark

> specify port range in spark
> ---
>
> Key: SPARK-4449
> URL: https://issues.apache.org/jira/browse/SPARK-4449
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Fei Wang
>Assignee: Apache Spark
>Priority: Minor
>
>  In some case, we need specify port range used in spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2015-04-02 Thread Chris Fregly (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392646#comment-14392646
 ] 

Chris Fregly commented on SPARK-6407:
-

from [~mengxr] 

"The online update should be implemented with GraphX or indexedrdd, 
which may take some time. There is no open-source solution.

Try doing a survey on existing algorithms for online matrix 
factorization updates."

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6679) java.lang.ClassNotFoundException on Mesos fine grained mode and input replication

2015-04-02 Thread JIRA

Ondřej Smola created SPARK-6679:
---

 Summary: java.lang.ClassNotFoundException on Mesos fine grained 
mode and input replication
 Key: SPARK-6679
 URL: https://issues.apache.org/jira/browse/SPARK-6679
 Project: Spark
  Issue Type: Bug
  Components: Mesos, Streaming
Affects Versions: 1.3.0
Reporter: Ondřej Smola


Spark Streaming 1.3.0, Mesos 0.21.1 - When using fine grained mode and receiver 
input replication ( StorageLevel.MEMORY_ONLY_2, StorageLevel.MEMORY_AND_DISK_2) 
following error is thrown. When using coarse grained mode it works. When not 
using replication (StorageLevel.MEMORY_ONLY ...) it works. Error:

15/03/26 14:50:00 ERROR TransportRequestHandler: Error while invoking 
RpcHandler#receive() on RPC id 7178767328921933569
java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:344)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:65)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:88)
at 
org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:65)
at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:124)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:97)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:91)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:44)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4449) specify port range in spark

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4449:
---

Assignee: (was: Apache Spark)

> specify port range in spark
> ---
>
> Key: SPARK-4449
> URL: https://issues.apache.org/jira/browse/SPARK-4449
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Fei Wang
>Priority: Minor
>
>  In some case, we need specify port range used in spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-04-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392721#comment-14392721
 ] 

Sean Owen commented on SPARK-2243:
--

[~sams] in this particular case, can you simply set both of these limits to the 
maximum that you need them to be, like 0.8 in both cases? It doesn't mean you 
can suddenly use 160% of memory of course. But these are just upper limits on 
what the cache/shuffle will consume, and not _also_ inversely limits on how 
much other stuff can use. That is, 80% cache doesn't stop tasks or shuffle from 
using 100% of all memory.

Yes, that means you have to manage the steps of your job carefully so that the 
memory hungry steps don't overlap. But you're already doing that.

Of course you could also throw more resource / memory at this too, but that 
costs £, but then again so does your time. If you're able to use the YARN 
dynamic executor scaling, maybe that's mitigated to some degree (although I 
think it triggers based on active tasks, not memory usage; it would save 
resource when little of anything is running).

Back to the general point: it seems like one motivation is reconfiguration, 
which is not quite the same as making multiple contexts; it should be easier.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN s

[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2015-04-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392742#comment-14392742
 ] 

Sean Owen commented on SPARK-6407:
--

ALS doesn't use gradient descent, at least not enough in the sense that these 
linear models do that you could reuse the implementation. I am accustomed to 
fold-in for approximate streaming updates to an ALS model, but yes it does kind 
of need to mutate some RDD-based data structured efficiently like an 
IndexedRDD. Although the idea is simple I also don't know of good theoretical 
approaches and have just made up reasonable heuristics in the past.

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6680) Be able to specifie IP for spark-shell(spark driver) blocker for Docker integration

2015-04-02 Thread Egor Pakhomov (JIRA)

Egor Pakhomov created SPARK-6680:


 Summary: Be able to specifie IP for spark-shell(spark driver) 
blocker for Docker integration
 Key: SPARK-6680
 URL: https://issues.apache.org/jira/browse/SPARK-6680
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Affects Versions: 1.3.0
 Environment: Docker.
Reporter: Egor Pakhomov
Priority: Blocker


Suppose I have 3 docker containers - spark_master, spark_worker and 
spark_shell. In docker for public IP of this container there is an alias like 
"fgsdfg454534". It only visible in this container. When spark use it for 
communication other containers receive this alias and don't know what to do 
with it. Thats why I used SPARK_LOCAL_IP for master and worker. But it doesn't 
work for spark driver(for spark shell - other types of drivers I haven't try). 
Spark driver sent everyone "fgsdfg454534" alias about itself and then nobody 
can address it. I've overcome it in https://github.com/epahomov/docker-spark, 
but it would be better if it would be solved on spark code level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-04-02 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392780#comment-14392780
 ] 

sam commented on SPARK-2243:


// sam in this particular case, can you simply set both of these limits to the 
maximum that you need them to be, like 0.8 in both cases? It doesn't mean you 
can suddenly use 160% of memory of course. But these are just upper limits on 
what the cache/shuffle will consume, and not also inversely limits on how much 
other stuff can use. That is, 80% cache doesn't stop tasks or shuffle from 
using 100% of all memory. //

Wat?! Man that *really* needs to be made clear in the documentation, it 
currently says "Fraction of Java heap to use" not "Fraction of Java heap to 
limit use by". Plus an extra sentence somewhere saying "these need not add up 
to a number less than 1.0" would clarify.

In that case I cannot think of a use case for multiple contexts and I'll 
withdraw my vote since I do not currently have a use case for changing the 
other configuration settings on the fly (they are mainly timeouts and such and 
such).

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
>

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-04-02 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392798#comment-14392798
 ] 

sam commented on SPARK-2243:


What I would suggest though, is putting an `assert`/`requires` somewhere to 
throw a meaningful exception when two SparkContexts are created.  The current 
ST is rather baffling. IME it seems FNF exceptions are the catch all exceptions 
of Spark - pretty much every exception can cause a FNF exception.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.ref

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-04-02 Thread Jason Hubbard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392804#comment-14392804
 ] 

Jason Hubbard commented on SPARK-2243:
--

I don't think programmatically spinning up JVMs is particularly easy to get 
correct either, nor do I think it is an elegant solution though I'm sure others 
might disagree.  In this case you will have to pay attention to may different 
things, including resource over utilization, handling environment variables, 
trying to message or share resources between JVMs, and JVM forking has had many 
issues in the past.  I guess I get a bit confused on the use of SparkContext 
anyway, if the intent isn't to have multiple, then why not force it to be a 
singleton and possibly a different name that would imply only a single instance 
should exist.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
>

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-04-02 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392848#comment-14392848
 ] 

sam commented on SPARK-2243:


Yup, a singleton would make sense, it's creation is side effecting, so one 
might as well have a method "setSparkContextConf(conf: SparkConf)" for setting 
the conf rather than creating a new SparkContext.  There is no point in using 
an pure FP pattern when the domain isn't pure.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Me

[jira] [Comment Edited] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-04-02 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392848#comment-14392848
 ] 

sam edited comment on SPARK-2243 at 4/2/15 3:37 PM:


Yup, a singleton would make sense, it's creation is side effecting, so one 
might as well have a method "initSparkConf(conf: SparkConf)" for initialization 
and setting the conf rather than creating a new SparkContext.  There is no 
point in using an pure FP pattern when the domain isn't pure.


was (Author: sams):
Yup, a singleton would make sense, it's creation is side effecting, so one 
might as well have a method "setSparkContextConf(conf: SparkConf)" for setting 
the conf rather than creating a new SparkContext.  There is no point in using 
an pure FP pattern when the domain isn't pure.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.bro

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-04-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392860#comment-14392860
 ] 

Sean Owen commented on SPARK-2243:
--

[~sams] the {{SparkContext}} constructor will throw an exception in 1.3 if you 
try to instantiate a second one. You can turn it off with 
{{spark.driver.allowMultipleContexts=true}}. Which doesn't make it 100% work 
but doesn't forbid it outright. I wouldn't recommend disabling it and it's off 
by default.

I think you'd be welcome to suggest a minor doc change. I'm 90% sure I'm right 
on this. Or else, we'd never see JVMs running out of memory unless the cache 
was quite full?

[~jahubba] I bet that in retrospect it would have been better to make the 
{{SparkContext}} uninstantiable and access it only via a factory method. Too 
late for that now. Some of what you're doing sounds like it would be the same 
even with simultaneous contexts -- you're still setting two sets of config, 
you're still thinking about resources. I think the forced decoupling makes 
things harder. I suppose decoupling has its upsides but costs in runtime 
complexity. That said there's also some classes of use cases that might do well 
to share both a JVM and {{SparkContext}}. These aren't 100% of use cases but I 
hope there's good news for some who find this issue and previously thought it 
was a blocker.

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolE

[jira] [Commented] (SPARK-5452) We are migrating Tera Data SQL to Spark SQL. Query is taking long time. Please have a look on this issue

2015-04-02 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392901#comment-14392901
 ] 

Nicholas Chammas commented on SPARK-5452:
-

Yeah, as Sean said, this post as it stands is not suitable for JIRA. 

You open a JIRA when you've identified a specific problem that you can 
demonstrate is specific to Spark.

Posting a large amount of code and configuration information like this doesn't 
tell anyone that there is a known problem, and no-one is going to dig for you 
to tell whether the performance issue is specific to Spark or caused by 
something else in your environment.

In other words, this is not a "Spark support" site. JIRA is for tracking 
specific bugs or new features, and this post describes neither.

> We are migrating Tera Data SQL to Spark SQL. Query is taking long time. 
> Please have a look on this issue
> 
>
> Key: SPARK-5452
> URL: https://issues.apache.org/jira/browse/SPARK-5452
> Project: Spark
>  Issue Type: Test
>  Components: Spark Shell
>Affects Versions: 1.2.0
>Reporter: irfan
>  Labels: SparkSql
>
> Hi Team,
> we are migrating TeraData SQL to Spark SQL because of complexity we have 
> spilted into below 4 sub-quries
> and we are running through  hive context
> 
> val HIVETMP1 = hc.sql("SELECT PARTY_ACCOUNT_ID AS 
> PARTY_ACCOUNT_ID,LMS_ACCOUNT_ID AS LMS_ACCOUNT_ID FROM VW_PARTY_ACCOUNT WHERE 
>  PARTY_ACCOUNT_TYPE_CODE IN('04') AND  LMS_ACCOUNT_ID  IS NOT NULL")
> HIVETMP1.registerTempTable("VW_HIVETMP1")
> val HIVETMP2 = hc.sql("SELECT PACCNT.LMS_ACCOUNT_ID AS  LMS_ACCOUNT_ID, 
> 'NULL' AS  RANDOM_PARTY_ACCOUNT_ID ,'NULL' AS  MOST_RECENT_SPEND_LA 
> ,STXN.PARTY_ACCOUNT_ID AS  MAX_SPEND_12WKS_LA ,STXN.MAX_SPEND_12WKS_LADATE  
> AS MAX_SPEND_12WKS_LADATE FROM VW_HIVETMP1 AS PACCNT  INNER JOIN (SELECT 
> STXTMP.PARTY_ACCOUNT_ID AS PARTY_ACCOUNT_ID, SUM(CASE WHEN 
> (CAST(STXTMP.TRANSACTION_DATE AS DATE ) > 
> DATE_SUB(CAST(CONCAT(SUBSTRING(SYSTMP.OPTION_VAL,1,4),'-',SUBSTRING(SYSTMP.OPTION_VAL,5,2),'-',SUBSTRING(SYSTMP.OPTION_VAL,7,2))
>  AS DATE),84)) THEN STXTMP.TRANSACTION_VALUE ELSE 0.00 END) AS 
> MAX_SPEND_12WKS_LADATE FROM VW_SHOPPING_TRANSACTION_TABLE AS STXTMP INNER 
> JOIN SYSTEM_OPTION_TABLE AS SYSTMP ON STXTMP.FLAG == SYSTMP.FLAG AND  
> SYSTMP.OPTION_NAME = 'RID' AND STXTMP.PARTY_ACCOUNT_TYPE_CODE IN('04') GROUP 
> BY STXTMP.PARTY_ACCOUNT_ID) AS STXN ON PACCNT.PARTY_ACCOUNT_ID = 
> STXN.PARTY_ACCOUNT_ID WHERE  STXN.MAX_SPEND_12WKS_LADATE IS NOT NULL")
> HIVETMP2.registerTempTable("VW_HIVETMP2")
> val HIVETMP3 = hc.sql("SELECT LMS_ACCOUNT_ID,MAX(MAX_SPEND_12WKS_LA) AS 
> MAX_SPEND_12WKS_LA, 1 AS RANK FROM VW_HIVETMP2 GROUP BY LMS_ACCOUNT_ID")
> HIVETMP3.registerTempTable("VW_HIVETMP3")
> val HIVETMP4 = hc.sql(" SELECT PACCNT.LMS_ACCOUNT_ID,'NULL' AS  
> RANDOM_PARTY_ACCOUNT_ID ,'NULL' AS  
> MOST_RECENT_SPEND_LA,STXN.MAX_SPEND_12WKS_LA AS MAX_SPEND_12WKS_LA,1 AS RANK1 
> FROM VW_HIVETMP2 AS PACCNT INNER JOIN VW_HIVETMP3 AS STXN ON 
> PACCNT.LMS_ACCOUNT_ID = STXN.LMS_ACCOUNT_ID AND PACCNT.MAX_SPEND_12WKS_LA = 
> STXN.MAX_SPEND_12WKS_LA")
> HIVETMP4.registerTempTable("WT03_ACCOUNT_BHVR3")
> HIVETMP4.saveAsTextFile("hdfs:/file/")
> ==
> This query has two Group By clauses which are running on huge files(19.5GB). 
> And the query took 40min to get the final result. Is there any changes 
> required in run time environment or Configuration Setting in Spark which can 
> improve the query performance.
> below are our Environment and configuration details:
> Environment  details:
> No of nodes:4
> capacity on each node:62 GB RAM on each node.
> Storage capacity :9TB on each node
> total cores  :48  
> Spark Configuration:
>  
> .set("spark.default.parallelism","64")
> .set("spark.driver.maxResultSize","2G")
> .set("spark.driver.memory","10g")
> .set("spark.rdd.compress","true")
> .set("spark.shuffle.spill.compress","true")
> .set("spark.shuffle.compress","true")
> .set("spark.shuffle.consolidateFiles","true/false")
> .set("spark.shuffle.spill","true/false") 
>  
> Data file size :
> SHOPPING_TRANSACTION 19.5GB
> PARTY_ACCOUNT1.4GB
> SYSTEM_OPTIONS   11.6K
> please help us to resolve above issue.
> Thanks,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings

2015-04-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392934#comment-14392934
 ] 

Sean Owen commented on SPARK-6569:
--

[~minisaw] are you going to submit a PR to reduce the log level? Cody seems OK 
with it.

> Kafka directInputStream logs what appear to be incorrect warnings
> -
>
> Key: SPARK-6569
> URL: https://issues.apache.org/jira/browse/SPARK-6569
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
> Environment: Spark 1.3.0
>Reporter: Platon Potapov
>Priority: Minor
>
> During what appears to be normal operation of streaming from a Kafka topic, 
> the following log records are observed, logged periodically:
> {code}
> [Stage 391:==>  (3 + 0) / 
> 4]
> 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
> same as ending offset skipping raw 0
> 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
> same as ending offset skipping raw 0
> 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
> same as ending offset skipping raw 0
> {code}
> * the part.fromOffset placeholder is not correctly substituted to a value
> * is the condition really mandates a warning being logged?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6209) ExecutorClassLoader can leak connections after failing to load classes from the REPL class server

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6209:
---

Assignee: Apache Spark  (was: Josh Rosen)

> ExecutorClassLoader can leak connections after failing to load classes from 
> the REPL class server
> -
>
> Key: SPARK-6209
> URL: https://issues.apache.org/jira/browse/SPARK-6209
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.3, 1.1.2, 1.2.1, 1.3.0, 1.4.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Critical
> Fix For: 1.3.1, 1.4.0
>
>
> ExecutorClassLoader does not ensure proper cleanup of network connections 
> that it opens.  If it fails to load a class, it may leak partially-consumed 
> InputStreams that are connected to the REPL's HTTP class server, causing that 
> server to exhaust its thread pool, which can cause the entire job to hang.
> Here is a simple reproduction:
> With
> {code}
> ./bin/spark-shell --master local-cluster[8,8,512] 
> {code}
> run the following command:
> {code}
> sc.parallelize(1 to 1000, 1000).map { x =>
>   try {
>   Class.forName("some.class.that.does.not.Exist")
>   } catch {
>   case e: Exception => // do nothing
>   }
>   x
> }.count()
> {code}
> This job will run 253 tasks, then will completely freeze without any errors 
> or failed tasks.
> It looks like the driver has 253 threads blocked in socketRead0() calls:
> {code}
> [joshrosen ~]$ jstack 16765 | grep socketRead0 | wc
>  253 759   14674
> {code}
> e.g.
> {code}
> "qtp1287429402-13" daemon prio=5 tid=0x7f868a1c nid=0x5b03 runnable 
> [0x0001159bd000]
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:152)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
> at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
> at 
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
> at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1044)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280)
> at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> at 
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> at 
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Thread.java:745) 
> {code}
> Jstack on the executors shows blocking in loadClass / findClass, where a 
> single thread is RUNNABLE and waiting to hear back from the driver and other 
> executor threads are BLOCKED on object monitor synchronization at 
> Class.forName0().
> Remotely triggering a GC on a hanging executor allows the job to progress and 
> complete more tasks before hanging again.  If I repeatedly trigger GC on all 
> of the executors, then the job runs to completion:
> {code}
> jps | grep CoarseGra | cut -d ' ' -f 1 | xargs -I {} -n 1 -P100 jcmd {} GC.run
> {code}
> The culprit is a {{catch}} block that ignores all exceptions and performs no 
> cleanup: 
> https://github.com/apache/spark/blob/v1.2.0/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L94
> This bug has been present since Spark 1.0.0, but I suspect that we haven't 
> seen it before because it's pretty hard to reproduce. Triggering this error 
> requires a job with tasks that trigger ClassNotFoundExceptions yet are still 
> able to run to completion.  It also requires that executors are able to leak 
> enough open connections to exhaust the class server's Jetty thread pool 
> limit, which requires that there are a large number of tasks (253+) and 
> either a large number of executors or a very low amount of GC pressure on 
> those executors (since GC will cause the leaked connections to be closed).
> The fix here is pretty simple: add proper resource cleanup to this class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6209) ExecutorClassLoader can leak connections after failing to load classes from the REPL class server

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6209:
---

Assignee: Josh Rosen  (was: Apache Spark)

> ExecutorClassLoader can leak connections after failing to load classes from 
> the REPL class server
> -
>
> Key: SPARK-6209
> URL: https://issues.apache.org/jira/browse/SPARK-6209
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.3, 1.1.2, 1.2.1, 1.3.0, 1.4.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.1, 1.4.0
>
>
> ExecutorClassLoader does not ensure proper cleanup of network connections 
> that it opens.  If it fails to load a class, it may leak partially-consumed 
> InputStreams that are connected to the REPL's HTTP class server, causing that 
> server to exhaust its thread pool, which can cause the entire job to hang.
> Here is a simple reproduction:
> With
> {code}
> ./bin/spark-shell --master local-cluster[8,8,512] 
> {code}
> run the following command:
> {code}
> sc.parallelize(1 to 1000, 1000).map { x =>
>   try {
>   Class.forName("some.class.that.does.not.Exist")
>   } catch {
>   case e: Exception => // do nothing
>   }
>   x
> }.count()
> {code}
> This job will run 253 tasks, then will completely freeze without any errors 
> or failed tasks.
> It looks like the driver has 253 threads blocked in socketRead0() calls:
> {code}
> [joshrosen ~]$ jstack 16765 | grep socketRead0 | wc
>  253 759   14674
> {code}
> e.g.
> {code}
> "qtp1287429402-13" daemon prio=5 tid=0x7f868a1c nid=0x5b03 runnable 
> [0x0001159bd000]
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:152)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
> at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
> at 
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
> at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1044)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280)
> at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> at 
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> at 
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Thread.java:745) 
> {code}
> Jstack on the executors shows blocking in loadClass / findClass, where a 
> single thread is RUNNABLE and waiting to hear back from the driver and other 
> executor threads are BLOCKED on object monitor synchronization at 
> Class.forName0().
> Remotely triggering a GC on a hanging executor allows the job to progress and 
> complete more tasks before hanging again.  If I repeatedly trigger GC on all 
> of the executors, then the job runs to completion:
> {code}
> jps | grep CoarseGra | cut -d ' ' -f 1 | xargs -I {} -n 1 -P100 jcmd {} GC.run
> {code}
> The culprit is a {{catch}} block that ignores all exceptions and performs no 
> cleanup: 
> https://github.com/apache/spark/blob/v1.2.0/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L94
> This bug has been present since Spark 1.0.0, but I suspect that we haven't 
> seen it before because it's pretty hard to reproduce. Triggering this error 
> requires a job with tasks that trigger ClassNotFoundExceptions yet are still 
> able to run to completion.  It also requires that executors are able to leak 
> enough open connections to exhaust the class server's Jetty thread pool 
> limit, which requires that there are a large number of tasks (253+) and 
> either a large number of executors or a very low amount of GC pressure on 
> those executors (since GC will cause the leaked connections to be closed).
> The fix here is pretty simple: add proper resource cleanup to this class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6662) Allow variable substitution in spark.yarn.historyServer.address

2015-04-02 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392963#comment-14392963
 ] 

Cheolsoo Park commented on SPARK-6662:
--

[~srowen], thank you for your comment.
{quote}
Wouldn't you be able to query for the YARN RM address somewhere and include it 
in the config?
{quote}
In typical cloud deployment, there is usually shared gateway from where users 
can connect to various clusters, and there is few Spark configs shared by all 
the clusters. Furthermore, clusters are usually transient in cloud, so I'd like 
to avoid adding any cluster-specific information to Spark configs.

My current workaround is grep'ing {{yarn.resourcemanager.hostname}} from 
yarn-site.xml in my custom job launch script on the gateway and passing it via 
{{--conf}} option in every job launch. The intention was to get rid of this 
hacky bit in my launch script.
{quote}
I am somewhat concerned about adding a narrow bit of support for one particular 
substitution, which in turn is to support a specific assumption in one type of 
deployment.
{quote}
Yes, I understand your concern. Even though I have a specific problem to solve 
at hand, I filed this jira hoping that general variable substitution will be 
added to Spark config. In fact, I made an attempt in that direction but quickly 
ran into the following problems:
# Adding general vars sub to Spark conf doesn't solve my problem. Since Spark 
config and Yarn config are separate entities in Spark, I cannot cross-refer to 
properties from one to the other.
# Alternatively, I could introduce a special logic for 
{{spark.yarn.historyServer.address}} assuming the RM and HS are on the same 
node. Since Spark AM already knows the RM address, it is trivial to implement. 
But this makes a even more specific assumption about the deployment.

Looks to me that it involves quite a bit of refactoring to implement general 
vars sub that allows cross-referring.

So I compromised. That is, I introduced vars sub only to the {{spark.yarn.}} 
properties. In fact, vars sub already work for {{spark.hadoop.}} properties. If 
you look at the code, all the {{spark.hadoop.}} properties are already copied 
over to Yarn config and read via Yarn config. As a side effect, they support 
vars sub. I am just expanding the scope of this *secret* feature to 
{{spark.yarn.}} properties.

For now, I can live with my current workaround. But I wanted to point out that 
it is not user-friendly to ask users to pass explicit hostname and port number 
to make use of HS. In fact, I'm not aware of any other property that causes 
same pain in YARN mode. For eg, the RM address for {{spark.master}} is 
dynamically picked up from yarn-site.xml. The HS address should be handled in a 
similar manner IMO.

Hope this explains my thought process well enough.

> Allow variable substitution in spark.yarn.historyServer.address
> ---
>
> Key: SPARK-6662
> URL: https://issues.apache.org/jira/browse/SPARK-6662
> Project: Spark
>  Issue Type: Wish
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Cheolsoo Park
>Priority: Minor
>  Labels: yarn
>
> In Spark on YARN, explicit hostname and port number need to be set for 
> "spark.yarn.historyServer.address" in SparkConf to make the HISTORY link. If 
> the history server address is known and static, this is usually not a problem.
> But in cloud, that is usually not true. Particularly in EMR, the history 
> server always runs on the same node as with RM. So I could simply set it to 
> {{$\{yarn.resourcemanager.hostname\}:18080}} if variable substitution is 
> allowed.
> In fact, Hadoop configuration already implements variable substitution, so if 
> this property is read via YarnConf, this can be easily achievable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6506) python support yarn cluster mode requires SPARK_HOME to be set

2015-04-02 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392971#comment-14392971
 ] 

Marcelo Vanzin commented on SPARK-6506:
---

Maybe you're running into SPARK-5808?

> python support yarn cluster mode requires SPARK_HOME to be set
> --
>
> Key: SPARK-6506
> URL: https://issues.apache.org/jira/browse/SPARK-6506
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Thomas Graves
>
> We added support for python running in yarn cluster mode in 
> https://issues.apache.org/jira/browse/SPARK-5173, but it requires that 
> SPARK_HOME be set in the environment variables for application master and 
> executor.  It doesn't have to be set to anything real but it fails if its not 
> set.  See the command at the end of: https://github.com/apache/spark/pull/3976



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-765) Test suite should run Spark example programs

2015-04-02 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392975#comment-14392975
 ] 

Josh Rosen commented on SPARK-765:
--

Hi [~yuu.ishik...@gmail.com],

It should be fined to add a {{test}}-scoped ScalaTest dependency to the 
examples subproject.

> Test suite should run Spark example programs
> 
>
> Key: SPARK-765
> URL: https://issues.apache.org/jira/browse/SPARK-765
> Project: Spark
>  Issue Type: New Feature
>  Components: Examples
>Reporter: Josh Rosen
>
> The Spark test suite should also run each of the Spark example programs (the 
> PySpark suite should do the same).  This should be done through a shell 
> script or other mechanism to simulate the environment setup used by end users 
> that run those scripts.
> This would prevent problems like SPARK-764 from making it into releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6506) python support yarn cluster mode requires SPARK_HOME to be set

2015-04-02 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392978#comment-14392978
 ] 

Thomas Graves commented on SPARK-6506:
--

No it was built with maven and the pyspark artifacts are there.  You just have 
to set SPARK_HOME to something even though it isn't used for anything.  
Something is looking at SPARK_HOME and blows up if its not set.

> python support yarn cluster mode requires SPARK_HOME to be set
> --
>
> Key: SPARK-6506
> URL: https://issues.apache.org/jira/browse/SPARK-6506
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Thomas Graves
>
> We added support for python running in yarn cluster mode in 
> https://issues.apache.org/jira/browse/SPARK-5173, but it requires that 
> SPARK_HOME be set in the environment variables for application master and 
> executor.  It doesn't have to be set to anything real but it fails if its not 
> set.  See the command at the end of: https://github.com/apache/spark/pull/3976



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings

2015-04-02 Thread Platon Potapov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392992#comment-14392992
 ] 

Platon Potapov commented on SPARK-6569:
---

If we talk about log level reduction only, then no, I don't feel strongly about 
the change.

But I would suggest not logging anything at all. Cody's rationale behind the 
warning is for empty input batches to not go unnoticed. I feel that is a 
responsibility of a monitoring agent external to the spark job - an agent 
knowing the specifics of the use case the job is being run in, and hence having 
the capacity to decide whether an empty batch is an abnormal situation (in my 
case, it isn't).

If that's not an option - well, please close the ticket then.


> Kafka directInputStream logs what appear to be incorrect warnings
> -
>
> Key: SPARK-6569
> URL: https://issues.apache.org/jira/browse/SPARK-6569
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
> Environment: Spark 1.3.0
>Reporter: Platon Potapov
>Priority: Minor
>
> During what appears to be normal operation of streaming from a Kafka topic, 
> the following log records are observed, logged periodically:
> {code}
> [Stage 391:==>  (3 + 0) / 
> 4]
> 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
> same as ending offset skipping raw 0
> 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
> same as ending offset skipping raw 0
> 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
> same as ending offset skipping raw 0
> {code}
> * the part.fromOffset placeholder is not correctly substituted to a value
> * is the condition really mandates a warning being logged?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6618) HiveMetastoreCatalog.lookupRelation should use fine-grained lock

2015-04-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392991#comment-14392991
 ] 

Apache Spark commented on SPARK-6618:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/5333

> HiveMetastoreCatalog.lookupRelation should use fine-grained lock
> 
>
> Key: SPARK-6618
> URL: https://issues.apache.org/jira/browse/SPARK-6618
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.3.1, 1.4.0
>
>
> Right now the entire method of HiveMetastoreCatalog.lookupRelation has a lock 
> (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L173)
>  and the scope of lock will cover resolving data source tables 
> (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L93).
>  So, lookupRelation can be extremely expensive when we are doing expensive 
> operations like parquet schema discovery. So, we should use fine-grained lock 
> for lookupRelation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-04-02 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393017#comment-14393017
 ] 

Joseph K. Bradley commented on SPARK-5992:
--

That sounds good; I'll try to take a look at that paper soon.  Will you be able 
to write a short design doc?

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5972) Cache residuals for GradientBoostedTrees during training

2015-04-02 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5972:
-
Assignee: Manoj Kumar

> Cache residuals for GradientBoostedTrees during training
> 
>
> Key: SPARK-5972
> URL: https://issues.apache.org/jira/browse/SPARK-5972
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
>Priority: Minor
>
> In gradient boosting, the current model's prediction is re-computed for each 
> training instance on every iteration.  The current residual (cumulative 
> prediction of previously trained trees in the ensemble) should be cached.  
> That could reduce both computation (only computing the prediction of the most 
> recently trained tree) and communication (only sending the most recently 
> trained tree to the workers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2015-04-02 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393021#comment-14393021
 ] 

Joseph K. Bradley commented on SPARK-6407:
--

I'm not too familiar with the area, but it seems similar to randomized linear 
algebra work if you can assume the incoming data is i.i.d.  But [~mengxr] may 
be more familiar with this literature than me...

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6194) collect() in PySpark will cause memory leak in JVM

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6194:
---

Assignee: Apache Spark  (was: Davies Liu)

> collect() in PySpark will cause memory leak in JVM
> --
>
> Key: SPARK-6194
> URL: https://issues.apache.org/jira/browse/SPARK-6194
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
>Reporter: Davies Liu
>Assignee: Apache Spark
>Priority: Critical
> Fix For: 1.2.2, 1.3.1, 1.4.0
>
>
> It could be reproduced  by:
> {code}
> for i in range(40):
> sc.parallelize(range(5000), 10).flatMap(lambda i: range(1)).collect()
> {code}
> It will fail after 2 or 3 jobs, and run totally successfully if I add
> `gc.collect()` after each job.
> We could call _detach() for the JavaList returned by collect
> in Java, will send out a PR for this.
> Reported by Michael and commented by Josh：
> On Thu, Mar 5, 2015 at 2:39 PM, Josh Rosen  wrote:
> > Based on Py4J's Memory Model page
> > (http://py4j.sourceforge.net/advanced_topics.html#py4j-memory-model):
> >
> >> Because Java objects on the Python side are involved in a circular
> >> reference (JavaObject and JavaMember reference each other), these objects
> >> are not immediately garbage collected once the last reference to the object
> >> is removed (but they are guaranteed to be eventually collected if the 
> >> Python
> >> garbage collector runs before the Python program exits).
> >
> >
> >>
> >> In doubt, users can always call the detach function on the Python gateway
> >> to explicitly delete a reference on the Java side. A call to gc.collect()
> >> also usually works.
> >
> >
> > Maybe we should be manually calling detach() when the Python-side has
> > finished consuming temporary objects from the JVM.  Do you have a small
> > workload / configuration that reproduces the OOM which we can use to test a
> > fix?  I don't think that I've seen this issue in the past, but this might be
> > because we mistook Java OOMs as being caused by collecting too much data
> > rather than due to memory leaks.
> >
> > On Thu, Mar 5, 2015 at 10:41 AM, Michael Nazario 
> > wrote:
> >>
> >> Hi Josh,
> >>
> >> I have a question about how PySpark does memory management in the Py4J
> >> bridge between the Java driver and the Python driver. I was wondering if
> >> there have been any memory problems in this system because the Python
> >> garbage collector does not collect circular references immediately and Py4J
> >> has circular references in each object it receives from Java.
> >>
> >> When I dug through the PySpark code, I seemed to find that most RDD
> >> actions return by calling collect. In collect, you end up calling the Java
> >> RDD collect and getting an iterator from that. Would this be a possible
> >> cause for a Java driver OutOfMemoryException because there are resources in
> >> Java which do not get freed up immediately?
> >>
> >> I have also seen that trying to take a lot of values from a dataset twice
> >> in a row can cause the Java driver to OOM (while just once works). Are 
> >> there
> >> some other memory considerations that are relevant in the driver?
> >>
> >> Thanks,
> >> Michael



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6194) collect() in PySpark will cause memory leak in JVM

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6194:
---

Assignee: Davies Liu  (was: Apache Spark)

> collect() in PySpark will cause memory leak in JVM
> --
>
> Key: SPARK-6194
> URL: https://issues.apache.org/jira/browse/SPARK-6194
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.2.2, 1.3.1, 1.4.0
>
>
> It could be reproduced  by:
> {code}
> for i in range(40):
> sc.parallelize(range(5000), 10).flatMap(lambda i: range(1)).collect()
> {code}
> It will fail after 2 or 3 jobs, and run totally successfully if I add
> `gc.collect()` after each job.
> We could call _detach() for the JavaList returned by collect
> in Java, will send out a PR for this.
> Reported by Michael and commented by Josh：
> On Thu, Mar 5, 2015 at 2:39 PM, Josh Rosen  wrote:
> > Based on Py4J's Memory Model page
> > (http://py4j.sourceforge.net/advanced_topics.html#py4j-memory-model):
> >
> >> Because Java objects on the Python side are involved in a circular
> >> reference (JavaObject and JavaMember reference each other), these objects
> >> are not immediately garbage collected once the last reference to the object
> >> is removed (but they are guaranteed to be eventually collected if the 
> >> Python
> >> garbage collector runs before the Python program exits).
> >
> >
> >>
> >> In doubt, users can always call the detach function on the Python gateway
> >> to explicitly delete a reference on the Java side. A call to gc.collect()
> >> also usually works.
> >
> >
> > Maybe we should be manually calling detach() when the Python-side has
> > finished consuming temporary objects from the JVM.  Do you have a small
> > workload / configuration that reproduces the OOM which we can use to test a
> > fix?  I don't think that I've seen this issue in the past, but this might be
> > because we mistook Java OOMs as being caused by collecting too much data
> > rather than due to memory leaks.
> >
> > On Thu, Mar 5, 2015 at 10:41 AM, Michael Nazario 
> > wrote:
> >>
> >> Hi Josh,
> >>
> >> I have a question about how PySpark does memory management in the Py4J
> >> bridge between the Java driver and the Python driver. I was wondering if
> >> there have been any memory problems in this system because the Python
> >> garbage collector does not collect circular references immediately and Py4J
> >> has circular references in each object it receives from Java.
> >>
> >> When I dug through the PySpark code, I seemed to find that most RDD
> >> actions return by calling collect. In collect, you end up calling the Java
> >> RDD collect and getting an iterator from that. Would this be a possible
> >> cause for a Java driver OutOfMemoryException because there are resources in
> >> Java which do not get freed up immediately?
> >>
> >> I have also seen that trying to take a lot of values from a dataset twice
> >> in a row can cause the Java driver to OOM (while just once works). Are 
> >> there
> >> some other memory considerations that are relevant in the driver?
> >>
> >> Thanks,
> >> Michael



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-6667) hang while collect in PySpark

2015-04-02 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6667:
--
Comment: was deleted

(was: This patch introduced a rare bug that can cause a hang while calling 
{{collect()}} (SPARK-6194); I'm commenting here so we remember to put the fix 
into the correct branches.)

> hang while collect in PySpark
> -
>
> Key: SPARK-6667
> URL: https://issues.apache.org/jira/browse/SPARK-6667
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> PySpark tests hang while collecting:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-677) PySpark should not collect results through local filesystem

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-677:
--

Assignee: Davies Liu  (was: Apache Spark)

> PySpark should not collect results through local filesystem
> ---
>
> Key: SPARK-677
> URL: https://issues.apache.org/jira/browse/SPARK-677
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0, 1.4.0
>Reporter: Josh Rosen
>Assignee: Davies Liu
>
> Py4J is slow when transferring large arrays, so PySpark currently dumps data 
> to the disk and reads it back in order to collect() RDDs.  On large enough 
> datasets, this data will spill from the buffer cache and write to the 
> physical disk, resulting in terrible performance.
> Instead, we should stream the data from Java to Python over a local socket or 
> a FIFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6667) hang while collect in PySpark

2015-04-02 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393023#comment-14393023
 ] 

Josh Rosen commented on SPARK-6667:
---

This patch introduced a rare bug that can cause a hang while calling 
{{collect()}} (SPARK-6194); I'm commenting here so we remember to put the fix 
into the correct branches.

> hang while collect in PySpark
> -
>
> Key: SPARK-6667
> URL: https://issues.apache.org/jira/browse/SPARK-6667
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> PySpark tests hang while collecting:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-677) PySpark should not collect results through local filesystem

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-677:
--

Assignee: Apache Spark  (was: Davies Liu)

> PySpark should not collect results through local filesystem
> ---
>
> Key: SPARK-677
> URL: https://issues.apache.org/jira/browse/SPARK-677
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0, 1.4.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Py4J is slow when transferring large arrays, so PySpark currently dumps data 
> to the disk and reads it back in order to collect() RDDs.  On large enough 
> datasets, this data will spill from the buffer cache and write to the 
> physical disk, resulting in terrible performance.
> Instead, we should stream the data from Java to Python over a local socket or 
> a FIFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6194) collect() in PySpark will cause memory leak in JVM

2015-04-02 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393026#comment-14393026
 ] 

Josh Rosen commented on SPARK-6194:
---

This patch introduced a rare bug that can cause a hang while calling 
{{collect()}} (SPARK-6194); I'm commenting here so that the issues are linked 
to ensure that the fixes land in the right branches.

> collect() in PySpark will cause memory leak in JVM
> --
>
> Key: SPARK-6194
> URL: https://issues.apache.org/jira/browse/SPARK-6194
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.2.2, 1.3.1, 1.4.0
>
>
> It could be reproduced  by:
> {code}
> for i in range(40):
> sc.parallelize(range(5000), 10).flatMap(lambda i: range(1)).collect()
> {code}
> It will fail after 2 or 3 jobs, and run totally successfully if I add
> `gc.collect()` after each job.
> We could call _detach() for the JavaList returned by collect
> in Java, will send out a PR for this.
> Reported by Michael and commented by Josh：
> On Thu, Mar 5, 2015 at 2:39 PM, Josh Rosen  wrote:
> > Based on Py4J's Memory Model page
> > (http://py4j.sourceforge.net/advanced_topics.html#py4j-memory-model):
> >
> >> Because Java objects on the Python side are involved in a circular
> >> reference (JavaObject and JavaMember reference each other), these objects
> >> are not immediately garbage collected once the last reference to the object
> >> is removed (but they are guaranteed to be eventually collected if the 
> >> Python
> >> garbage collector runs before the Python program exits).
> >
> >
> >>
> >> In doubt, users can always call the detach function on the Python gateway
> >> to explicitly delete a reference on the Java side. A call to gc.collect()
> >> also usually works.
> >
> >
> > Maybe we should be manually calling detach() when the Python-side has
> > finished consuming temporary objects from the JVM.  Do you have a small
> > workload / configuration that reproduces the OOM which we can use to test a
> > fix?  I don't think that I've seen this issue in the past, but this might be
> > because we mistook Java OOMs as being caused by collecting too much data
> > rather than due to memory leaks.
> >
> > On Thu, Mar 5, 2015 at 10:41 AM, Michael Nazario 
> > wrote:
> >>
> >> Hi Josh,
> >>
> >> I have a question about how PySpark does memory management in the Py4J
> >> bridge between the Java driver and the Python driver. I was wondering if
> >> there have been any memory problems in this system because the Python
> >> garbage collector does not collect circular references immediately and Py4J
> >> has circular references in each object it receives from Java.
> >>
> >> When I dug through the PySpark code, I seemed to find that most RDD
> >> actions return by calling collect. In collect, you end up calling the Java
> >> RDD collect and getting an iterator from that. Would this be a possible
> >> cause for a Java driver OutOfMemoryException because there are resources in
> >> Java which do not get freed up immediately?
> >>
> >> I have also seen that trying to take a lot of values from a dataset twice
> >> in a row can cause the Java driver to OOM (while just once works). Are 
> >> there
> >> some other memory considerations that are relevant in the driver?
> >>
> >> Thanks,
> >> Michael



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6194) collect() in PySpark will cause memory leak in JVM

2015-04-02 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6194.
---
  Resolution: Fixed
Target Version/s: 1.3.0, 1.2.2  (was: 1.0.3, 1.1.2, 1.2.2, 1.3.0)

Going to resolve this as "Fixed" for now, since I don't think we're going to 
backport prior to 1.2.x.

> collect() in PySpark will cause memory leak in JVM
> --
>
> Key: SPARK-6194
> URL: https://issues.apache.org/jira/browse/SPARK-6194
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.2.2, 1.3.1, 1.4.0
>
>
> It could be reproduced  by:
> {code}
> for i in range(40):
> sc.parallelize(range(5000), 10).flatMap(lambda i: range(1)).collect()
> {code}
> It will fail after 2 or 3 jobs, and run totally successfully if I add
> `gc.collect()` after each job.
> We could call _detach() for the JavaList returned by collect
> in Java, will send out a PR for this.
> Reported by Michael and commented by Josh：
> On Thu, Mar 5, 2015 at 2:39 PM, Josh Rosen  wrote:
> > Based on Py4J's Memory Model page
> > (http://py4j.sourceforge.net/advanced_topics.html#py4j-memory-model):
> >
> >> Because Java objects on the Python side are involved in a circular
> >> reference (JavaObject and JavaMember reference each other), these objects
> >> are not immediately garbage collected once the last reference to the object
> >> is removed (but they are guaranteed to be eventually collected if the 
> >> Python
> >> garbage collector runs before the Python program exits).
> >
> >
> >>
> >> In doubt, users can always call the detach function on the Python gateway
> >> to explicitly delete a reference on the Java side. A call to gc.collect()
> >> also usually works.
> >
> >
> > Maybe we should be manually calling detach() when the Python-side has
> > finished consuming temporary objects from the JVM.  Do you have a small
> > workload / configuration that reproduces the OOM which we can use to test a
> > fix?  I don't think that I've seen this issue in the past, but this might be
> > because we mistook Java OOMs as being caused by collecting too much data
> > rather than due to memory leaks.
> >
> > On Thu, Mar 5, 2015 at 10:41 AM, Michael Nazario 
> > wrote:
> >>
> >> Hi Josh,
> >>
> >> I have a question about how PySpark does memory management in the Py4J
> >> bridge between the Java driver and the Python driver. I was wondering if
> >> there have been any memory problems in this system because the Python
> >> garbage collector does not collect circular references immediately and Py4J
> >> has circular references in each object it receives from Java.
> >>
> >> When I dug through the PySpark code, I seemed to find that most RDD
> >> actions return by calling collect. In collect, you end up calling the Java
> >> RDD collect and getting an iterator from that. Would this be a possible
> >> cause for a Java driver OutOfMemoryException because there are resources in
> >> Java which do not get freed up immediately?
> >>
> >> I have also seen that trying to take a lot of values from a dataset twice
> >> in a row can cause the Java driver to OOM (while just once works). Are 
> >> there
> >> some other memory considerations that are relevant in the driver?
> >>
> >> Thanks,
> >> Michael



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6681) JAVA_HOME error with upgrade to Spark 1.3.0

2015-04-02 Thread Ken Williams (JIRA)

Ken Williams created SPARK-6681:
---

 Summary: JAVA_HOME error with upgrade to Spark 1.3.0
 Key: SPARK-6681
 URL: https://issues.apache.org/jira/browse/SPARK-6681
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 1.3.0
 Environment: Client is Mac OS X version 10.10.2, cluster is running 
HDP 2.1 stack.
Reporter: Ken Williams


I’m trying to upgrade a Spark project, written in Scala, from Spark 1.2.1 to 
1.3.0, so I changed my `build.sbt` like so:

{code}
-libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.1" % 
"provided"
+libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" % 
"provided"
{code}

then make an `assembly` jar, and submit it:

{code}
HADOOP_CONF_DIR=/etc/hadoop/conf \
spark-submit \
--driver-class-path=/etc/hbase/conf \
--conf spark.hadoop.validateOutputSpecs=false \
--conf 
spark.yarn.jar=hdfs:/apps/local/spark-assembly-1.3.0-hadoop2.4.0.jar \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--deploy-mode=cluster \
--master=yarn \
--class=TestObject \
--num-executors=54 \
target/scala-2.11/myapp-assembly-1.2.jar
{code}

The job fails to submit, with the following exception in the terminal:

{code}
15/03/19 10:30:07 INFO yarn.Client: 
15/03/19 10:20:03 INFO yarn.Client: 
 client token: N/A
 diagnostics: Application application_1420225286501_4698 failed 2 times 
due to AM 
 Container for appattempt_1420225286501_4698_02 exited with  
exitCode: 127 
 due to: Exception from container-launch: 
org.apache.hadoop.util.Shell$ExitCodeException: 
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
{code}


Finally, I go and check the YARN app master’s web interface (since the job is 
there, I know it at least made it that far), and the only logs it shows are 
these:

{code}
Log Type: stderr
Log Length: 61
/bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory

Log Type: stdout
Log Length: 0
{code}

I’m not sure how to interpret that - is {{ {{JAVA_HOME}} }} a literal 
(including the brackets) that’s somehow making it into a script?  Is this 
coming from the worker nodes or the driver?  Anything I can do to experiment & 
troubleshoot?

I do have {{JAVA_HOME}} set in the hadoop config files on all the nodes of the 
cluster:

{code}
% grep JAVA_HOME /etc/hadoop/conf/*.sh
/etc/hadoop/conf/hadoop-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31
/etc/hadoop/conf/yarn-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31
{code}

Has this behavior changed in 1.3.0 since 1.2.1?  Using 1.2.1 and making no 
other changes, the job completes fine.

(Note: I originally posted this on the Spark mailing list and also on Stack 
Overflow, I'll update both places if/when I find a solution.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6667) hang while collect in PySpark

2015-04-02 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6667:
--
 Priority: Blocker  (was: Critical)
 Target Version/s: 1.2.2, 1.3.1, 1.4.0  (was: 1.3.1, 1.4.0)
Affects Version/s: 1.2.2

This was introduced by SPARK-6194, which was merged for 1.2.2, so I'm adding 
1.2.2. as an affected / target version.

> hang while collect in PySpark
> -
>
> Key: SPARK-6667
> URL: https://issues.apache.org/jira/browse/SPARK-6667
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.2, 1.3.1, 1.4.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
>
> PySpark tests hang while collecting:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-02 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-6682:


 Summary: Deprecate static train and use builder instead for 
Scala/Java
 Key: SPARK-6682
 URL: https://issues.apache.org/jira/browse/SPARK-6682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley


In MLlib, we have for some time been unofficially moving away from the old 
static train() methods and moving towards builder patterns.  This JIRA is to 
discuss this move and (hopefully) make it official.

"Old static train()" API:
{code}
val myModel = NaiveBayes.train(myData, ...)
{code}

"New builder pattern" API:
{code}
val nb = new NaiveBayes().setLambda(0.1)
val myModel = nb.train(myData)
{code}

Pros of the builder pattern:
* Much less code when algorithms have many parameters.  Since Java does not 
support default arguments, we required *many* duplicated static train() methods 
(for each prefix set of arguments).
* Helps to enforce default parameters.  Users should ideally not have to even 
think about setting parameters if they just want to try an algorithm quickly.
* Matches spark.ml API

Cons:
* In Python APIs, static train methods are more "Pythonic."

Proposal:
* Scala/Java: We should start deprecating the old static train() methods.  We 
must keep them for API stability, but deprecating will help with API 
consistency, making it clear that everyone should use the builder pattern.  As 
we deprecate them, we should make sure that the builder pattern supports all 
parameters.
* Python: Keep static train methods.

CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6683) GLMs with GradientDescent could scale step size instead of features

2015-04-02 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-6683:


 Summary: GLMs with GradientDescent could scale step size instead 
of features
 Key: SPARK-6683
 URL: https://issues.apache.org/jira/browse/SPARK-6683
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor


GeneralizedLinearAlgorithm can scale features.  This improves optimization 
behavior (and also affects the optimal solution, as is being discussed and 
hopefully fixed by [https://github.com/apache/spark/pull/5055]).

This is a bit inefficient since it requires making a rescaled copy of the data.

GradientDescent could instead scale the step size separately for each feature 
(and adjust regularization as needed; see the PR linked above).  This would 
require storing a vector of length numFeatures, rather than making a full copy 
of the data.

I haven't thought this through for LBFGS, so I'm not sure if it's generally 
usable or would require a specialization for GLMs with GradientDescent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format

2015-04-02 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393141#comment-14393141
 ] 

Zhan Zhang commented on SPARK-2883:
---

Following code demonstrate the usage of the orc support.



import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
//saveAsOrcFile
case class AllDataTypes(
stringField: String,
intField: Int,
longField: Long,
floatField: Float,
doubleField: Double,
shortField: Short,
byteField: Byte,
booleanField: Boolean)

val range = (0 to 255)
val data = sc.parallelize(range).map(x => AllDataTypes(s"$x", x, x.toLong, 
x.toFloat, x.toDouble, x.toShort, x.toByte, x % 2 == 0))
data.toDF().saveAsOrcFile("orcTest")
//read orcFile
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val orcTest = hiveContext.orcFile("orcTest")
orcTest.registerTempTable("orcTest")
hiveContext.sql("SELECT * from orcTest where 
intfield>185").collect.foreach(println)

  hiveContext.sql("create temporary table orc using 
org.apache.spark.sql.hive.orc OPTIONS (path \"orcTest\")")
  hiveContext.sql("select * from orc").collect.foreach(println)
val table = hiveContext.sql("select * from orc")
table.saveAsTable("table", "org.apache.spark.sql.hive.orc")
val hiveOrc = hiveContext.orcFile("/user/hive/warehouse/table")
hiveOrc.registerTempTable("hiveOrc")
hiveContext.sql("select * from hiveOrc").collect.foreach(println)
table.saveAsOrcFile("/user/ambari-qa/table")
hiveContext.sql("create temporary table normal_orc_as_source USING 
org.apache.spark.sql.hive.orc OPTIONS (path 'saveTable') as select * from 
table")

> Spark Support for ORCFile format
> 
>
> Key: SPARK-2883
> URL: https://issues.apache.org/jira/browse/SPARK-2883
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Reporter: Zhan Zhang
>Priority: Blocker
> Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
> pm jobtracker.png, orc.diff
>
>
> Verify the support of OrcInputFormat in spark, fix issues if exists and add 
> documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3720) support ORC in spark sql

2015-04-02 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393146#comment-14393146
 ] 

Zhan Zhang commented on SPARK-3720:
---

[~iward] I have update the patch with new api support.  you can refer to 
https://issues.apache.org/jira/browse/SPARK-2883

> support ORC in spark sql
> 
>
> Key: SPARK-3720
> URL: https://issues.apache.org/jira/browse/SPARK-3720
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Fei Wang
> Attachments: orc.diff
>
>
> The Optimized Row Columnar (ORC) file format provides a highly efficient way 
> to store data on hdfs.ORC file format has many advantages such as:
> 1 a single file as the output of each task, which reduces the NameNode's load
> 2 Hive type support including datetime, decimal, and the complex types 
> (struct, list, map, and union)
> 3 light-weight indexes stored within the file
> skip row groups that don't pass predicate filtering
> seek to a given row
> 4 block-mode compression based on data type
> run-length encoding for integer columns
> dictionary encoding for string columns
> 5 concurrent reads of the same file using separate RecordReaders
> 6 ability to split files without scanning for markers
> 7 bound the amount of memory needed for reading or writing
> 8 metadata stored using Protocol Buffers, which allows addition and removal 
> of fields
> Now spark sql support Parquet, support ORC provide people more opts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6684) Add checkpointing to GradientBoostedTrees

2015-04-02 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-6684:


 Summary: Add checkpointing to GradientBoostedTrees
 Key: SPARK-6684
 URL: https://issues.apache.org/jira/browse/SPARK-6684
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor


We should add checkpointing to GradientBoostedTrees since it maintains RDDs 
with long lineages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6671) Add status command for spark daemons

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6671:
---

Assignee: (was: Apache Spark)

> Add status command for spark daemons
> 
>
> Key: SPARK-6671
> URL: https://issues.apache.org/jira/browse/SPARK-6671
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: PRADEEP CHANUMOLU
>  Labels: easyfix
> Fix For: 1.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently using the spark-daemon.sh script we can start and stop the spark 
> demons. But we cannot get the status of the daemons. It will be nice to 
> include the status command in the spark-daemon.sh script, through which we 
> can know if the spark demon is alive or not. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6671) Add status command for spark daemons

2015-04-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393182#comment-14393182
 ] 

Apache Spark commented on SPARK-6671:
-

User 'pchanumolu' has created a pull request for this issue:
https://github.com/apache/spark/pull/5327

> Add status command for spark daemons
> 
>
> Key: SPARK-6671
> URL: https://issues.apache.org/jira/browse/SPARK-6671
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: PRADEEP CHANUMOLU
>  Labels: easyfix
> Fix For: 1.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently using the spark-daemon.sh script we can start and stop the spark 
> demons. But we cannot get the status of the daemons. It will be nice to 
> include the status command in the spark-daemon.sh script, through which we 
> can know if the spark demon is alive or not. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6671) Add status command for spark daemons

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6671:
---

Assignee: Apache Spark

> Add status command for spark daemons
> 
>
> Key: SPARK-6671
> URL: https://issues.apache.org/jira/browse/SPARK-6671
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: PRADEEP CHANUMOLU
>Assignee: Apache Spark
>  Labels: easyfix
> Fix For: 1.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently using the spark-daemon.sh script we can start and stop the spark 
> demons. But we cannot get the status of the daemons. It will be nice to 
> include the status command in the spark-daemon.sh script, through which we 
> can know if the spark demon is alive or not. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6667) hang while collect in PySpark

2015-04-02 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6667.
---
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1
   1.2.2

> hang while collect in PySpark
> -
>
> Key: SPARK-6667
> URL: https://issues.apache.org/jira/browse/SPARK-6667
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.2, 1.3.1, 1.4.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.2.2, 1.3.1, 1.4.0
>
>
> PySpark tests hang while collecting:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4194) Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4194:
---

Assignee: Apache Spark

> Exceptions thrown during SparkContext or SparkEnv construction might lead to 
> resource leaks or corrupted global state
> -
>
> Key: SPARK-4194
> URL: https://issues.apache.org/jira/browse/SPARK-4194
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Critical
>
> The SparkContext and SparkEnv constructors instantiate a bunch of objects 
> that may need to be cleaned up after they're no longer needed.  If an 
> exception is thrown during SparkContext or SparkEnv construction (e.g. due to 
> a bad configuration setting), then objects created earlier in the constructor 
> may not be properly cleaned up.
> This is unlikely to cause problems for batch jobs submitted through 
> {{spark-submit}}, since failure to construct SparkContext will probably cause 
> the JVM to exit, but it is a potentially serious issue in interactive 
> environments where a user might attempt to create SparkContext with some 
> configuration, fail due to an error, and re-attempt the creation with new 
> settings.  In this case, resources from the previous creation attempt might 
> not have been cleaned up and could lead to confusing errors (especially if 
> the old, leaked resources share global state with the new SparkContext).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4194) Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state

2015-04-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393249#comment-14393249
 ] 

Apache Spark commented on SPARK-4194:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5335

> Exceptions thrown during SparkContext or SparkEnv construction might lead to 
> resource leaks or corrupted global state
> -
>
> Key: SPARK-4194
> URL: https://issues.apache.org/jira/browse/SPARK-4194
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>Priority: Critical
>
> The SparkContext and SparkEnv constructors instantiate a bunch of objects 
> that may need to be cleaned up after they're no longer needed.  If an 
> exception is thrown during SparkContext or SparkEnv construction (e.g. due to 
> a bad configuration setting), then objects created earlier in the constructor 
> may not be properly cleaned up.
> This is unlikely to cause problems for batch jobs submitted through 
> {{spark-submit}}, since failure to construct SparkContext will probably cause 
> the JVM to exit, but it is a potentially serious issue in interactive 
> environments where a user might attempt to create SparkContext with some 
> configuration, fail due to an error, and re-attempt the creation with new 
> settings.  In this case, resources from the previous creation attempt might 
> not have been cleaned up and could lead to confusing errors (especially if 
> the old, leaked resources share global state with the new SparkContext).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4194) Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4194:
---

Assignee: (was: Apache Spark)

> Exceptions thrown during SparkContext or SparkEnv construction might lead to 
> resource leaks or corrupted global state
> -
>
> Key: SPARK-4194
> URL: https://issues.apache.org/jira/browse/SPARK-4194
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>Priority: Critical
>
> The SparkContext and SparkEnv constructors instantiate a bunch of objects 
> that may need to be cleaned up after they're no longer needed.  If an 
> exception is thrown during SparkContext or SparkEnv construction (e.g. due to 
> a bad configuration setting), then objects created earlier in the constructor 
> may not be properly cleaned up.
> This is unlikely to cause problems for batch jobs submitted through 
> {{spark-submit}}, since failure to construct SparkContext will probably cause 
> the JVM to exit, but it is a potentially serious issue in interactive 
> environments where a user might attempt to create SparkContext with some 
> configuration, fail due to an error, and re-attempt the creation with new 
> settings.  In this case, resources from the previous creation attempt might 
> not have been cleaned up and could lead to confusing errors (especially if 
> the old, leaked resources share global state with the new SparkContext).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2883) Spark Support for ORCFile format

2015-04-02 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393141#comment-14393141
 ] 

Zhan Zhang edited comment on SPARK-2883 at 4/2/15 7:54 PM:
---

Following code demonstrate the usage of the orc support.
@climberus following examples demonstrate how to use it:

import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
//schema
case class AllDataTypes(
stringField: String,
intField: Int,
longField: Long,
floatField: Float,
doubleField: Double,
shortField: Short,
byteField: Byte,
booleanField: Boolean)
//saveAsOrcFile
val range = (0 to 255)
val data = sc.parallelize(range).map(x => AllDataTypes(s"$x", x, x.toLong, 
x.toFloat, x.toDouble, x.toShort, x.toByte, x % 2 == 0))
 data.toDF().saveAsOrcFile("orcTest")
//read orcFile
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
//orcFile
val orcTest = hiveContext.orcFile("orcTest")
orcTest.registerTempTable("orcTest")
hiveContext.sql("SELECT * from orcTest where 
intfield>185").collect.foreach(println)
//new data source API, read
  hiveContext.sql("create temporary table orc using 
org.apache.spark.sql.hive.orc OPTIONS (path \"orcTest\")")
  hiveContext.sql("select * from orc").collect.foreach(println)
val table = hiveContext.sql("select * from orc")
// new data source API write
table.saveAsTable("table", "org.apache.spark.sql.hive.orc")
val hiveOrc = hiveContext.orcFile("/user/hive/warehouse/table")
hiveOrc.registerTempTable("hiveOrc")
hiveContext.sql("select * from hiveOrc").collect.foreach(println)
table.saveAsOrcFile("/user/ambari-qa/table")
hiveContext.sql("create temporary table normal_orc_as_source USING 
org.apache.spark.sql.hive.orc OPTIONS (path 'saveTable') as select * from 
table")


was (Author: zzhan):
Following code demonstrate the usage of the orc support.



import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
//saveAsOrcFile
case class AllDataTypes(
stringField: String,
intField: Int,
longField: Long,
floatField: Float,
doubleField: Double,
shortField: Short,
byteField: Byte,
booleanField: Boolean)

val range = (0 to 255)
val data = sc.parallelize(range).map(x => AllDataTypes(s"$x", x, x.toLong, 
x.toFloat, x.toDouble, x.toShort, x.toByte, x % 2 == 0))
data.toDF().saveAsOrcFile("orcTest")
//read orcFile
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val orcTest = hiveContext.orcFile("orcTest")
orcTest.registerTempTable("orcTest")
hiveContext.sql("SELECT * from orcTest where 
intfield>185").collect.foreach(println)

  hiveContext.sql("create temporary table orc using 
org.apache.spark.sql.hive.orc OPTIONS (path \"orcTest\")")
  hiveContext.sql("select * from orc").collect.foreach(println)
val table = hiveContext.sql("select * from orc")
table.saveAsTable("table", "org.apache.spark.sql.hive.orc")
val hiveOrc = hiveContext.orcFile("/user/hive/warehouse/table")
hiveOrc.registerTempTable("hiveOrc")
hiveContext.sql("select * from hiveOrc").collect.foreach(println)
table.saveAsOrcFile("/user/ambari-qa/table")
hiveContext.sql("create temporary table normal_orc_as_source USING 
org.apache.spark.sql.hive.orc OPTIONS (path 'saveTable') as select * from 
table")

> Spark Support for ORCFile format
> 
>
> Key: SPARK-2883
> URL: https://issues.apache.org/jira/browse/SPARK-2883
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Reporter: Zhan Zhang
>Priority: Blocker
> Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
> pm jobtracker.png, orc.diff
>
>
> Verify the support of OrcInputFormat in spark, fix issues if exists and add 
> documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2883) Spark Support for ORCFile format

2015-04-02 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393141#comment-14393141
 ] 

Zhan Zhang edited comment on SPARK-2883 at 4/2/15 7:54 PM:
---

Following code demonstrate the usage of the orc support.

import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
//schema
case class AllDataTypes(
stringField: String,
intField: Int,
longField: Long,
floatField: Float,
doubleField: Double,
shortField: Short,
byteField: Byte,
booleanField: Boolean)
//saveAsOrcFile
val range = (0 to 255)
val data = sc.parallelize(range).map(x => AllDataTypes(s"$x", x, x.toLong, 
x.toFloat, x.toDouble, x.toShort, x.toByte, x % 2 == 0))
 data.toDF().saveAsOrcFile("orcTest")
//read orcFile
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
//orcFile
val orcTest = hiveContext.orcFile("orcTest")
orcTest.registerTempTable("orcTest")
hiveContext.sql("SELECT * from orcTest where 
intfield>185").collect.foreach(println)
//new data source API, read
  hiveContext.sql("create temporary table orc using 
org.apache.spark.sql.hive.orc OPTIONS (path \"orcTest\")")
  hiveContext.sql("select * from orc").collect.foreach(println)
val table = hiveContext.sql("select * from orc")
// new data source API write
table.saveAsTable("table", "org.apache.spark.sql.hive.orc")
val hiveOrc = hiveContext.orcFile("/user/hive/warehouse/table")
hiveOrc.registerTempTable("hiveOrc")
hiveContext.sql("select * from hiveOrc").collect.foreach(println)
table.saveAsOrcFile("/user/ambari-qa/table")
hiveContext.sql("create temporary table normal_orc_as_source USING 
org.apache.spark.sql.hive.orc OPTIONS (path 'saveTable') as select * from 
table")


was (Author: zzhan):
Following code demonstrate the usage of the orc support.
@climberus following examples demonstrate how to use it:

import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
//schema
case class AllDataTypes(
stringField: String,
intField: Int,
longField: Long,
floatField: Float,
doubleField: Double,
shortField: Short,
byteField: Byte,
booleanField: Boolean)
//saveAsOrcFile
val range = (0 to 255)
val data = sc.parallelize(range).map(x => AllDataTypes(s"$x", x, x.toLong, 
x.toFloat, x.toDouble, x.toShort, x.toByte, x % 2 == 0))
 data.toDF().saveAsOrcFile("orcTest")
//read orcFile
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
//orcFile
val orcTest = hiveContext.orcFile("orcTest")
orcTest.registerTempTable("orcTest")
hiveContext.sql("SELECT * from orcTest where 
intfield>185").collect.foreach(println)
//new data source API, read
  hiveContext.sql("create temporary table orc using 
org.apache.spark.sql.hive.orc OPTIONS (path \"orcTest\")")
  hiveContext.sql("select * from orc").collect.foreach(println)
val table = hiveContext.sql("select * from orc")
// new data source API write
table.saveAsTable("table", "org.apache.spark.sql.hive.orc")
val hiveOrc = hiveContext.orcFile("/user/hive/warehouse/table")
hiveOrc.registerTempTable("hiveOrc")
hiveContext.sql("select * from hiveOrc").collect.foreach(println)
table.saveAsOrcFile("/user/ambari-qa/table")
hiveContext.sql("create temporary table normal_orc_as_source USING 
org.apache.spark.sql.hive.orc OPTIONS (path 'saveTable') as select * from 
table")

> Spark Support for ORCFile format
> 
>
> Key: SPARK-2883
> URL: https://issues.apache.org/jira/browse/SPARK-2883
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Reporter: Zhan Zhang
>Priority: Blocker
> Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
> pm jobtracker.png, orc.diff
>
>
> Verify the support of OrcInputFormat in spark, fix issues if exists and add 
> documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6479) Create off-heap block storage API (internal)

2015-04-02 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-6479:
--
Attachment: SPARK-6479.pdf

This is the updated version for offheap store internal api design.

> Create off-heap block storage API (internal)
> 
>
> Key: SPARK-6479
> URL: https://issues.apache.org/jira/browse/SPARK-6479
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6479.pdf, SparkOffheapsupportbyHDFS.pdf
>
>
> Would be great to create APIs for off-heap block stores, rather than doing a 
> bunch of if statements everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6578) Outbound channel in network library is not thread-safe, can lead to fetch failures

2015-04-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393284#comment-14393284
 ] 

Apache Spark commented on SPARK-6578:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5336

> Outbound channel in network library is not thread-safe, can lead to fetch 
> failures
> --
>
> Key: SPARK-6578
> URL: https://issues.apache.org/jira/browse/SPARK-6578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 1.3.1, 1.4.0
>
>
> There is a very narrow race in the outbound channel of the network library. 
> While netty guarantees that the inbound channel is thread-safe, the same is 
> not true for the outbound channel: multiple threads can be writing and 
> running the pipeline at the same time.
> This leads to an issue with MessageEncoder and the optimization it performs 
> for zero-copy of file data: since a single RPC can be broken into multiple 
> buffers (for , example when replying to a chunk request), if you have 
> multiple threads writing these RPCs then they can be mixed up in the final 
> socket. That breaks framing and will cause the receiving side to not 
> understand the messages.
> Patch coming up shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6112) Provide OffHeap support through HDFS RAM_DISK

2015-04-02 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393287#comment-14393287
 ] 

Zhan Zhang commented on SPARK-6112:
---

Design spec for API attached to SPARK-6479 and wait for the community feedback.

> Provide OffHeap support through HDFS RAM_DISK
> -
>
> Key: SPARK-6112
> URL: https://issues.apache.org/jira/browse/SPARK-6112
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager
>Reporter: Zhan Zhang
> Attachments: SparkOffheapsupportbyHDFS.pdf
>
>
> HDFS Lazy_Persist policy provide possibility to cache the RDD off_heap in 
> hdfs. We may want to provide similar capacity to Tachyon by leveraging hdfs 
> RAM_DISK feature, if the user environment does not have tachyon deployed. 
> With this feature, it potentially provides possibility to share RDD in memory 
> across different jobs and even share with jobs other than spark, and avoid 
> the RDD recomputation if executors crash. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6079) Use index to speed up StatusTracker.getJobIdsForGroup()

2015-04-02 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6079:
--
Fix Version/s: 1.3.1

> Use index to speed up StatusTracker.getJobIdsForGroup()
> ---
>
> Key: SPARK-6079
> URL: https://issues.apache.org/jira/browse/SPARK-6079
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
> Fix For: 1.3.1, 1.4.0
>
>
> {{StatusTracker.getJobIdsForGroup()}} is implemented via a linear scan over a 
> HashMap rather than using an index.  This might be an expensive operation if 
> there are many (e.g. thousands) of retained jobs.  We can add a new index to 
> speed this up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6685) Use DSYRK to compute AtA in ALS

2015-04-02 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6685:
-
Description: Now we use DSPR to compute AtA in ALS, which is a Level 2 BLAS 
routine. We should switch to DSYRK to use native BLAS to accelerate the 
computation. The factors should remain dense vectors. And we can pre-allocate a 
buffer to stack vectors and do Level 3. This requires some benchmark to 
demonstrate the improvement.  (was: Now we use DSPR to compute AtA in ALS, 
which is a Level 2 BLAS routine. We should switch to DSYRK to use native BLAS 
to accelerate the computation. The factors should remain dense vectors. And we 
can pre-allocate a buffer to stack vectors and do Level 3.)

> Use DSYRK to compute AtA in ALS
> ---
>
> Key: SPARK-6685
> URL: https://issues.apache.org/jira/browse/SPARK-6685
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Priority: Minor
>
> Now we use DSPR to compute AtA in ALS, which is a Level 2 BLAS routine. We 
> should switch to DSYRK to use native BLAS to accelerate the computation. The 
> factors should remain dense vectors. And we can pre-allocate a buffer to 
> stack vectors and do Level 3. This requires some benchmark to demonstrate the 
> improvement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6685) Use DSYRK to compute AtA in ALS

2015-04-02 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-6685:


 Summary: Use DSYRK to compute AtA in ALS
 Key: SPARK-6685
 URL: https://issues.apache.org/jira/browse/SPARK-6685
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Priority: Minor


Now we use DSPR to compute AtA in ALS, which is a Level 2 BLAS routine. We 
should switch to DSYRK to use native BLAS to accelerate the computation. The 
factors should remain dense vectors. And we can pre-allocate a buffer to stack 
vectors and do Level 3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6671) Add status command for spark daemons

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6671:
---

Assignee: Apache Spark

> Add status command for spark daemons
> 
>
> Key: SPARK-6671
> URL: https://issues.apache.org/jira/browse/SPARK-6671
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: PRADEEP CHANUMOLU
>Assignee: Apache Spark
>  Labels: easyfix
> Fix For: 1.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently using the spark-daemon.sh script we can start and stop the spark 
> demons. But we cannot get the status of the daemons. It will be nice to 
> include the status command in the spark-daemon.sh script, through which we 
> can know if the spark demon is alive or not. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6671) Add status command for spark daemons

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6671:
---

Assignee: (was: Apache Spark)

> Add status command for spark daemons
> 
>
> Key: SPARK-6671
> URL: https://issues.apache.org/jira/browse/SPARK-6671
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: PRADEEP CHANUMOLU
>  Labels: easyfix
> Fix For: 1.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently using the spark-daemon.sh script we can start and stop the spark 
> demons. But we cannot get the status of the daemons. It will be nice to 
> include the status command in the spark-daemon.sh script, through which we 
> can know if the spark demon is alive or not. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6479) Create off-heap block storage API (internal)

2015-04-02 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393331#comment-14393331
 ] 

Reynold Xin commented on SPARK-6479:


Looks good overall. I have some comments but those might be best in the form of 
code reviews.

The high level feedback if that while you are at it, it would be great to 
document the failure/exception semantics.


> Create off-heap block storage API (internal)
> 
>
> Key: SPARK-6479
> URL: https://issues.apache.org/jira/browse/SPARK-6479
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6479.pdf, SparkOffheapsupportbyHDFS.pdf
>
>
> Would be great to create APIs for off-heap block stores, rather than doing a 
> bunch of if statements everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6414) Spark driver failed with NPE on job cancelation

2015-04-02 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6414:
--
Assignee: Hung Lin

> Spark driver failed with NPE on job cancelation
> ---
>
> Key: SPARK-6414
> URL: https://issues.apache.org/jira/browse/SPARK-6414
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.2.1, 1.3.0
>Reporter: Yuri Makhno
>Assignee: Hung Lin
>Priority: Critical
> Fix For: 1.2.2, 1.3.1, 1.4.0
>
>
> When a job group is cancelled, we scan through all jobs to determine which 
> are members of the group. This scan assumes that the job group property is 
> always set. If 'properties' is null in an active job, you get an NPE.
> We just need to make sure we ignore ones where activeJob.properties is null. 
> We should also make sure it works if the particular property is missing.
> https://github.com/apache/spark/blob/branch-1.3/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L678



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6414) Spark driver failed with NPE on job cancelation

2015-04-02 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6414.
---
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1
   1.2.2

Fixed in 1.2.2, 1.3.1, and 1.4.0.

> Spark driver failed with NPE on job cancelation
> ---
>
> Key: SPARK-6414
> URL: https://issues.apache.org/jira/browse/SPARK-6414
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.2.1, 1.3.0
>Reporter: Yuri Makhno
>Assignee: Hung Lin
>Priority: Critical
> Fix For: 1.2.2, 1.3.1, 1.4.0
>
>
> When a job group is cancelled, we scan through all jobs to determine which 
> are members of the group. This scan assumes that the job group property is 
> always set. If 'properties' is null in an active job, you get an NPE.
> We just need to make sure we ignore ones where activeJob.properties is null. 
> We should also make sure it works if the particular property is missing.
> https://github.com/apache/spark/blob/branch-1.3/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L678



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4346) YarnClientSchedulerBack.asyncMonitorApplication should be common with Client.monitorApplication

2015-04-02 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4346:
-
Affects Version/s: 1.0.0

> YarnClientSchedulerBack.asyncMonitorApplication should be common with 
> Client.monitorApplication
> ---
>
> Key: SPARK-4346
> URL: https://issues.apache.org/jira/browse/SPARK-4346
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>
> The YarnClientSchedulerBackend.asyncMonitorApplication routine should move 
> into ClientBase and be made common with monitorApplication.  Make sure stop 
> is handled properly.
> See discussion on https://github.com/apache/spark/pull/3143



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6578) Outbound channel in network library is not thread-safe, can lead to fetch failures

2015-04-02 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6578:
---
Fix Version/s: 1.2.2

> Outbound channel in network library is not thread-safe, can lead to fetch 
> failures
> --
>
> Key: SPARK-6578
> URL: https://issues.apache.org/jira/browse/SPARK-6578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 1.2.2, 1.3.1, 1.4.0
>
>
> There is a very narrow race in the outbound channel of the network library. 
> While netty guarantees that the inbound channel is thread-safe, the same is 
> not true for the outbound channel: multiple threads can be writing and 
> running the pipeline at the same time.
> This leads to an issue with MessageEncoder and the optimization it performs 
> for zero-copy of file data: since a single RPC can be broken into multiple 
> buffers (for , example when replying to a chunk request), if you have 
> multiple threads writing these RPCs then they can be mixed up in the final 
> socket. That breaks framing and will cause the receiving side to not 
> understand the messages.
> Patch coming up shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6578) Outbound channel in network library is not thread-safe, can lead to fetch failures

2015-04-02 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6578:
---
Target Version/s: 1.2.2, 1.3.1, 1.4.0  (was: 1.3.1, 1.4.0)

> Outbound channel in network library is not thread-safe, can lead to fetch 
> failures
> --
>
> Key: SPARK-6578
> URL: https://issues.apache.org/jira/browse/SPARK-6578
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 1.2.2, 1.3.1, 1.4.0
>
>
> There is a very narrow race in the outbound channel of the network library. 
> While netty guarantees that the inbound channel is thread-safe, the same is 
> not true for the outbound channel: multiple threads can be writing and 
> running the pipeline at the same time.
> This leads to an issue with MessageEncoder and the optimization it performs 
> for zero-copy of file data: since a single RPC can be broken into multiple 
> buffers (for , example when replying to a chunk request), if you have 
> multiple threads writing these RPCs then they can be mixed up in the final 
> socket. That breaks framing and will cause the receiving side to not 
> understand the messages.
> Patch coming up shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6443) Could not submit app in standalone cluster mode when HA is enabled

2015-04-02 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6443:
-
Affects Version/s: 1.0.0

> Could not submit app in standalone cluster mode when HA is enabled
> --
>
> Key: SPARK-6443
> URL: https://issues.apache.org/jira/browse/SPARK-6443
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.0.0
>Reporter: Tao Wang
>Priority: Critical
>
> After digging some codes, I found user could not submit app in standalone 
> cluster mode when HA is enabled. But in client mode it can work.
> Haven't try yet. But I will verify this and file a PR to resolve it if the 
> problem exists.
> 3/23 update:
> I started a HA cluster with zk, and tried to submit SparkPi example with 
> command:
> ./spark-submit  --class org.apache.spark.examples.SparkPi --master 
> spark://doggie153:7077,doggie159:7077 --deploy-mode cluster 
> ../lib/spark-examples-1.2.0-hadoop2.4.0.jar 
> and it failed with error message:
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: 
> spark://doggie153:7077,doggie159:7077
> akka.actor.ActorInitializationException: exception during creation
> at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
> at akka.actor.ActorCell.create(ActorCell.scala:596)
> at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
> at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
> at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: org.apache.spark.SparkException: Invalid master URL: 
> spark://doggie153:7077,doggie159:7077
> at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
> at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42)
> at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
> at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35)
> at akka.actor.ActorCell.create(ActorCell.scala:580)
> ... 9 more
> But in client mode it ended with correct result. So my guess is right. I will 
> fix it in the related PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6675) HiveContext setConf is not stable

2015-04-02 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6675:

Labels:   (was: patch)

> HiveContext setConf is not stable
> -
>
> Key: SPARK-6675
> URL: https://issues.apache.org/jira/browse/SPARK-6675
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: AWS ec2 xlarge2 cluster launched by spark's script
>Reporter: Julien
>
> I find HiveContext.setConf does not work correctly. Here are some code 
> snippets showing the problem:
> snippet 1:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.{SparkConf, SparkContext}
> object Main extends App {
>   val conf = new SparkConf()
> .setAppName("context-test")
> .setMaster("local[8]")
>   val sc = new SparkContext(conf)
>   val hc = new HiveContext(sc)
>   hc.setConf("spark.sql.shuffle.partitions", "10")
>   hc.setConf("hive.metastore.warehouse.dir", 
> "/home/spark/hive/warehouse_test")
>   hc.getAllConfs filter(_._1.contains("warehouse.dir")) foreach println
>   hc.getAllConfs filter(_._1.contains("shuffle.partitions")) foreach println
> }
> {code}
> Results:
> (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test)
> (spark.sql.shuffle.partitions,10)
> snippet 2:
> {code}
> ...
>   hc.setConf("hive.metastore.warehouse.dir", 
> "/home/spark/hive/warehouse_test")
>   hc.setConf("spark.sql.shuffle.partitions", "10")
>   hc.getAllConfs filter(_._1.contains("warehouse.dir")) foreach println
>   hc.getAllConfs filter(_._1.contains("shuffle.partitions")) foreach println
> ...
> {code}
> Results:
> (hive.metastore.warehouse.dir,/user/hive/warehouse)
> (spark.sql.shuffle.partitions,10)
> You can see that I just permuted the two setConf call, then that leads to two 
> different Hive configuration.
> It seems that HiveContext can not set a new value on 
> "hive.metastore.warehouse.dir" key in one or the first "setConf" call.
> You need another "setConf" call before changing 
> "hive.metastore.warehouse.dir". For example, set 
> "hive.metastore.warehouse.dir" twice and the snippet 1
> snippet 3:
> {code}
> ...
>   hc.setConf("hive.metastore.warehouse.dir", 
> "/home/spark/hive/warehouse_test")
>   hc.setConf("hive.metastore.warehouse.dir", 
> "/home/spark/hive/warehouse_test")
>   hc.getAllConfs filter(_._1.contains("warehouse.dir")) foreach println
> ...
> {code}
> Results:
> (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test)
> You can reproduce this if you move to the latest branch-1.3 (1.3.1-snapshot, 
> htag = 7d029cb1eb6f1df1bce1a3f5784fb7ce2f981a33)
> I have also tested the released 1.3.0 (htag = 
> 4aaf48d46d13129f0f9bdafd771dd80fe568a7dc). It has the same problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6675) HiveContext setConf is not stable

2015-04-02 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6675:

Target Version/s: 1.4.0  (was: 1.3.0)

> HiveContext setConf is not stable
> -
>
> Key: SPARK-6675
> URL: https://issues.apache.org/jira/browse/SPARK-6675
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: AWS ec2 xlarge2 cluster launched by spark's script
>Reporter: Julien
>
> I find HiveContext.setConf does not work correctly. Here are some code 
> snippets showing the problem:
> snippet 1:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.{SparkConf, SparkContext}
> object Main extends App {
>   val conf = new SparkConf()
> .setAppName("context-test")
> .setMaster("local[8]")
>   val sc = new SparkContext(conf)
>   val hc = new HiveContext(sc)
>   hc.setConf("spark.sql.shuffle.partitions", "10")
>   hc.setConf("hive.metastore.warehouse.dir", 
> "/home/spark/hive/warehouse_test")
>   hc.getAllConfs filter(_._1.contains("warehouse.dir")) foreach println
>   hc.getAllConfs filter(_._1.contains("shuffle.partitions")) foreach println
> }
> {code}
> Results:
> (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test)
> (spark.sql.shuffle.partitions,10)
> snippet 2:
> {code}
> ...
>   hc.setConf("hive.metastore.warehouse.dir", 
> "/home/spark/hive/warehouse_test")
>   hc.setConf("spark.sql.shuffle.partitions", "10")
>   hc.getAllConfs filter(_._1.contains("warehouse.dir")) foreach println
>   hc.getAllConfs filter(_._1.contains("shuffle.partitions")) foreach println
> ...
> {code}
> Results:
> (hive.metastore.warehouse.dir,/user/hive/warehouse)
> (spark.sql.shuffle.partitions,10)
> You can see that I just permuted the two setConf call, then that leads to two 
> different Hive configuration.
> It seems that HiveContext can not set a new value on 
> "hive.metastore.warehouse.dir" key in one or the first "setConf" call.
> You need another "setConf" call before changing 
> "hive.metastore.warehouse.dir". For example, set 
> "hive.metastore.warehouse.dir" twice and the snippet 1
> snippet 3:
> {code}
> ...
>   hc.setConf("hive.metastore.warehouse.dir", 
> "/home/spark/hive/warehouse_test")
>   hc.setConf("hive.metastore.warehouse.dir", 
> "/home/spark/hive/warehouse_test")
>   hc.getAllConfs filter(_._1.contains("warehouse.dir")) foreach println
> ...
> {code}
> Results:
> (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test)
> You can reproduce this if you move to the latest branch-1.3 (1.3.1-snapshot, 
> htag = 7d029cb1eb6f1df1bce1a3f5784fb7ce2f981a33)
> I have also tested the released 1.3.0 (htag = 
> 4aaf48d46d13129f0f9bdafd771dd80fe568a7dc). It has the same problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6443) Support HA in standalone cluster modehen HA is enabled

2015-04-02 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6443:
-
Summary: Support HA in standalone cluster modehen HA is enabled  (was: 
Could not submit app in standalone cluster mode when HA is enabled)

> Support HA in standalone cluster modehen HA is enabled
> --
>
> Key: SPARK-6443
> URL: https://issues.apache.org/jira/browse/SPARK-6443
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.0.0
>Reporter: Tao Wang
>Priority: Critical
>
> After digging some codes, I found user could not submit app in standalone 
> cluster mode when HA is enabled. But in client mode it can work.
> Haven't try yet. But I will verify this and file a PR to resolve it if the 
> problem exists.
> 3/23 update:
> I started a HA cluster with zk, and tried to submit SparkPi example with 
> command:
> ./spark-submit  --class org.apache.spark.examples.SparkPi --master 
> spark://doggie153:7077,doggie159:7077 --deploy-mode cluster 
> ../lib/spark-examples-1.2.0-hadoop2.4.0.jar 
> and it failed with error message:
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: 
> spark://doggie153:7077,doggie159:7077
> akka.actor.ActorInitializationException: exception during creation
> at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
> at akka.actor.ActorCell.create(ActorCell.scala:596)
> at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
> at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
> at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: org.apache.spark.SparkException: Invalid master URL: 
> spark://doggie153:7077,doggie159:7077
> at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
> at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42)
> at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
> at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35)
> at akka.actor.ActorCell.create(ActorCell.scala:580)
> ... 9 more
> But in client mode it ended with correct result. So my guess is right. I will 
> fix it in the related PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6443) Support HA in standalone cluster mode

2015-04-02 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6443:
-
Priority: Major  (was: Critical)

> Support HA in standalone cluster mode
> -
>
> Key: SPARK-6443
> URL: https://issues.apache.org/jira/browse/SPARK-6443
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.0.0
>Reporter: Tao Wang
>
> After digging some codes, I found user could not submit app in standalone 
> cluster mode when HA is enabled. But in client mode it can work.
> Haven't try yet. But I will verify this and file a PR to resolve it if the 
> problem exists.
> 3/23 update:
> I started a HA cluster with zk, and tried to submit SparkPi example with 
> command:
> ./spark-submit  --class org.apache.spark.examples.SparkPi --master 
> spark://doggie153:7077,doggie159:7077 --deploy-mode cluster 
> ../lib/spark-examples-1.2.0-hadoop2.4.0.jar 
> and it failed with error message:
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: 
> spark://doggie153:7077,doggie159:7077
> akka.actor.ActorInitializationException: exception during creation
> at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
> at akka.actor.ActorCell.create(ActorCell.scala:596)
> at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
> at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
> at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: org.apache.spark.SparkException: Invalid master URL: 
> spark://doggie153:7077,doggie159:7077
> at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
> at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42)
> at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
> at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35)
> at akka.actor.ActorCell.create(ActorCell.scala:580)
> ... 9 more
> But in client mode it ended with correct result. So my guess is right. I will 
> fix it in the related PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6443) Support HA in standalone cluster mode

2015-04-02 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6443:
-
Summary: Support HA in standalone cluster mode  (was: Support HA in 
standalone cluster modehen HA is enabled)

> Support HA in standalone cluster mode
> -
>
> Key: SPARK-6443
> URL: https://issues.apache.org/jira/browse/SPARK-6443
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.0.0
>Reporter: Tao Wang
>Priority: Critical
>
> After digging some codes, I found user could not submit app in standalone 
> cluster mode when HA is enabled. But in client mode it can work.
> Haven't try yet. But I will verify this and file a PR to resolve it if the 
> problem exists.
> 3/23 update:
> I started a HA cluster with zk, and tried to submit SparkPi example with 
> command:
> ./spark-submit  --class org.apache.spark.examples.SparkPi --master 
> spark://doggie153:7077,doggie159:7077 --deploy-mode cluster 
> ../lib/spark-examples-1.2.0-hadoop2.4.0.jar 
> and it failed with error message:
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: 
> spark://doggie153:7077,doggie159:7077
> akka.actor.ActorInitializationException: exception during creation
> at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
> at akka.actor.ActorCell.create(ActorCell.scala:596)
> at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
> at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
> at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: org.apache.spark.SparkException: Invalid master URL: 
> spark://doggie153:7077,doggie159:7077
> at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
> at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42)
> at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
> at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35)
> at akka.actor.ActorCell.create(ActorCell.scala:580)
> ... 9 more
> But in client mode it ended with correct result. So my guess is right. I will 
> fix it in the related PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6669) Lock metastore client in analyzeTable

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6669:
---

Assignee: Apache Spark  (was: Michael Armbrust)

> Lock metastore client in analyzeTable
> -
>
> Key: SPARK-6669
> URL: https://issues.apache.org/jira/browse/SPARK-6669
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6669) Lock metastore client in analyzeTable

2015-04-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393582#comment-14393582
 ] 

Apache Spark commented on SPARK-6669:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/5333

> Lock metastore client in analyzeTable
> -
>
> Key: SPARK-6669
> URL: https://issues.apache.org/jira/browse/SPARK-6669
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Yin Huai
>Assignee: Michael Armbrust
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6669) Lock metastore client in analyzeTable

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6669:
---

Assignee: Michael Armbrust  (was: Apache Spark)

> Lock metastore client in analyzeTable
> -
>
> Key: SPARK-6669
> URL: https://issues.apache.org/jira/browse/SPARK-6669
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Yin Huai
>Assignee: Michael Armbrust
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6479) Create off-heap block storage API (internal)

2015-04-02 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-6479:
--
Attachment: SPARK-6479OffheapAPIdesign.pdf

Add failure case handling overall design and example.

> Create off-heap block storage API (internal)
> 
>
> Key: SPARK-6479
> URL: https://issues.apache.org/jira/browse/SPARK-6479
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6479.pdf, SPARK-6479OffheapAPIdesign.pdf, 
> SparkOffheapsupportbyHDFS.pdf
>
>
> Would be great to create APIs for off-heap block stores, rather than doing a 
> bunch of if statements everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6443) Support HA in standalone cluster mode

2015-04-02 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6443:
-
Description: 
After digging some codes, I found user could not submit app in standalone 
cluster mode when HA is enabled. But in client mode it can work.

Haven't try yet. But I will verify this and file a PR to resolve it if the 
problem exists.

3/23 update:
I started a HA cluster with zk, and tried to submit SparkPi example with 
command:
./spark-submit  --class org.apache.spark.examples.SparkPi --master 
spark://doggie153:7077,doggie159:7077 --deploy-mode cluster 
../lib/spark-examples-1.2.0-hadoop2.4.0.jar 

and it failed with error message:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: 
spark://doggie153:7077,doggie159:7077
akka.actor.ActorInitializationException: exception during creation
at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
at akka.actor.ActorCell.create(ActorCell.scala:596)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.spark.SparkException: Invalid master URL: 
spark://doggie153:7077,doggie159:7077
at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42)
at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35)
at akka.actor.ActorCell.create(ActorCell.scala:580)
... 9 more

But in client mode it ended with correct result. So my guess is right. I will 
fix it in the related PR.

=== EDIT by Andrew ===

>From a quick survey in the code I can confirm that client mode does support 
>this. [This 
>line|https://github.com/apache/spark/blob/e3202aa2e9bd140effbcf2a7a02b90cb077e760b/core/src/main/scala/org/apache/spark/SparkContext.scala#L2162]
> splits the master URLs by comma and passes these URLs into the AppClient. In 
>standalone cluster mode, there is not equivalent logic to even split the 
>master URLs, whether in the old submission gateway (o.a.s.deploy.Client) or in 
>the new one (o.a.s.deploy.rest.StandaloneRestClient).

Thus, this is an unsupported feature, not a bug!

  was:
After digging some codes, I found user could not submit app in standalone 
cluster mode when HA is enabled. But in client mode it can work.

Haven't try yet. But I will verify this and file a PR to resolve it if the 
problem exists.

3/23 update:
I started a HA cluster with zk, and tried to submit SparkPi example with 
command:
./spark-submit  --class org.apache.spark.examples.SparkPi --master 
spark://doggie153:7077,doggie159:7077 --deploy-mode cluster 
../lib/spark-examples-1.2.0-hadoop2.4.0.jar 

and it failed with error message:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: 
spark://doggie153:7077,doggie159:7077
akka.actor.ActorInitializationException: exception during creation
at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
at akka.actor.ActorCell.create(ActorCell.scala:596)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.spark.SparkException: Invalid master URL: 
spark://doggie153:7077,doggie159:7077
at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42)
at akka.actor.Actor$class.ar

[jira] [Comment Edited] (SPARK-6479) Create off-heap block storage API (internal)

2015-04-02 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393590#comment-14393590
 ] 

Zhan Zhang edited comment on SPARK-6479 at 4/2/15 10:23 PM:


[~rxin] Thanks for the feedback. I updated the document with the basic design 
and example on handling failure case thrown by underlying implementation. 
Basically, the shim layer handles all the exceptions/failure cases and return 
expected failure value to upper layer, similar to below example.

  override def getBytes(blockId: BlockId): Option[ByteBuffer] = {

try {

  offHeapManager.flatMap(_.getBytes(blockId))

} catch {

  case _ =>logError(s"error in getBytes from $blockId")

None
}
  }


was (Author: zzhan):
[~rxin] Thanks for the feedback. I updated the document with the basic design 
and example on handling failure case thrown by underlying implementation. 
Basically, the shim layer handles all the exceptions/failure cases and return 
expected failure value to upper layer, similar to below example.

  override def getBytes(blockId: BlockId): Option[ByteBuffer] = {
try {
  offHeapManager.flatMap(_.getBytes(blockId))
} catch {
  case _ =>logError(s"error in getBytes from $blockId")
None
}
  }

> Create off-heap block storage API (internal)
> 
>
> Key: SPARK-6479
> URL: https://issues.apache.org/jira/browse/SPARK-6479
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6479.pdf, SPARK-6479OffheapAPIdesign.pdf, 
> SparkOffheapsupportbyHDFS.pdf
>
>
> Would be great to create APIs for off-heap block stores, rather than doing a 
> bunch of if statements everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6479) Create off-heap block storage API (internal)

2015-04-02 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393590#comment-14393590
 ] 

Zhan Zhang commented on SPARK-6479:
---

[~rxin] Thanks for the feedback. I updated the document with the basic design 
and example on handling failure case thrown by underlying implementation. 
Basically, the shim layer handles all the exceptions/failure cases and return 
expected failure value to upper layer, similar to below example.

  override def getBytes(blockId: BlockId): Option[ByteBuffer] = {
try {
  offHeapManager.flatMap(_.getBytes(blockId))
} catch {
  case _ =>logError(s"error in getBytes from $blockId")
None
}
  }

> Create off-heap block storage API (internal)
> 
>
> Key: SPARK-6479
> URL: https://issues.apache.org/jira/browse/SPARK-6479
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6479.pdf, SPARK-6479OffheapAPIdesign.pdf, 
> SparkOffheapsupportbyHDFS.pdf
>
>
> Would be great to create APIs for off-heap block stores, rather than doing a 
> bunch of if statements everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6479) Create off-heap block storage API (internal)

2015-04-02 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393590#comment-14393590
 ] 

Zhan Zhang edited comment on SPARK-6479 at 4/2/15 10:24 PM:


[~rxin] Thanks for the feedback. I updated the document with the basic design 
and example on handling failure case thrown by underlying implementation. 
Basically, the shim layer handles all the exceptions/failure cases and return 
expected failure value to upper layer, similar to below example.

override def getBytes(blockId: BlockId): Option[ByteBuffer] = {
 try {
offHeapManager.flatMap(_.getBytes(blockId))
} catch {
case _ =>logError(s"error in getBytes from $blockId")
None
}
}


was (Author: zzhan):
[~rxin] Thanks for the feedback. I updated the document with the basic design 
and example on handling failure case thrown by underlying implementation. 
Basically, the shim layer handles all the exceptions/failure cases and return 
expected failure value to upper layer, similar to below example.

  override def getBytes(blockId: BlockId): Option[ByteBuffer] = {

try {

  offHeapManager.flatMap(_.getBytes(blockId))

} catch {

  case _ =>logError(s"error in getBytes from $blockId")

None
}
  }

> Create off-heap block storage API (internal)
> 
>
> Key: SPARK-6479
> URL: https://issues.apache.org/jira/browse/SPARK-6479
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6479.pdf, SPARK-6479OffheapAPIdesign.pdf, 
> SparkOffheapsupportbyHDFS.pdf
>
>
> Would be great to create APIs for off-heap block stores, rather than doing a 
> bunch of if statements everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6443) Support HA in standalone cluster mode

2015-04-02 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6443:
-
Description: 
== EDIT by Andrew ==

>From a quick survey in the code I can confirm that client mode does support 
>this. [This 
>line|https://github.com/apache/spark/blob/e3202aa2e9bd140effbcf2a7a02b90cb077e760b/core/src/main/scala/org/apache/spark/SparkContext.scala#L2162]
> splits the master URLs by comma and passes these URLs into the AppClient. In 
>standalone cluster mode, there is simply no equivalent logic to even split the 
>master URLs, whether in the old submission gateway (o.a.s.deploy.Client) or in 
>the new one (o.a.s.deploy.rest.StandaloneRestClient).

Thus, this is an unsupported feature, not a bug!

== Original description from Tao Wang ==

After digging some codes, I found user could not submit app in standalone 
cluster mode when HA is enabled. But in client mode it can work.

Haven't try yet. But I will verify this and file a PR to resolve it if the 
problem exists.

3/23 update:
I started a HA cluster with zk, and tried to submit SparkPi example with 
command:
./spark-submit  --class org.apache.spark.examples.SparkPi --master 
spark://doggie153:7077,doggie159:7077 --deploy-mode cluster 
../lib/spark-examples-1.2.0-hadoop2.4.0.jar 

and it failed with error message:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: 
spark://doggie153:7077,doggie159:7077
akka.actor.ActorInitializationException: exception during creation
at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
at akka.actor.ActorCell.create(ActorCell.scala:596)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.spark.SparkException: Invalid master URL: 
spark://doggie153:7077,doggie159:7077
at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42)
at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35)
at akka.actor.ActorCell.create(ActorCell.scala:580)
... 9 more

But in client mode it ended with correct result. So my guess is right. I will 
fix it in the related PR.

  was:
== EDIT by Andrew ==

>From a quick survey in the code I can confirm that client mode does support 
>this. [This 
>line|https://github.com/apache/spark/blob/e3202aa2e9bd140effbcf2a7a02b90cb077e760b/core/src/main/scala/org/apache/spark/SparkContext.scala#L2162]
> splits the master URLs by comma and passes these URLs into the AppClient. In 
>standalone cluster mode, there is not equivalent logic to even split the 
>master URLs, whether in the old submission gateway (o.a.s.deploy.Client) or in 
>the new one (o.a.s.deploy.rest.StandaloneRestClient).

Thus, this is an unsupported feature, not a bug!

== Original description from Tao Wang ==

After digging some codes, I found user could not submit app in standalone 
cluster mode when HA is enabled. But in client mode it can work.

Haven't try yet. But I will verify this and file a PR to resolve it if the 
problem exists.

3/23 update:
I started a HA cluster with zk, and tried to submit SparkPi example with 
command:
./spark-submit  --class org.apache.spark.examples.SparkPi --master 
spark://doggie153:7077,doggie159:7077 --deploy-mode cluster 
../lib/spark-examples-1.2.0-hadoop2.4.0.jar 

and it failed with error message:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: 
spark://doggie153:7077,doggie159:7077
akka.actor.ActorInitializationException: exception during creation
at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
at akka.actor.ActorCell.create(ActorCell.scala:596)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala

[jira] [Updated] (SPARK-6443) Support HA in standalone cluster mode

2015-04-02 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6443:
-
Description: 
== EDIT by Andrew ==

>From a quick survey in the code I can confirm that client mode does support 
>this. [This 
>line|https://github.com/apache/spark/blob/e3202aa2e9bd140effbcf2a7a02b90cb077e760b/core/src/main/scala/org/apache/spark/SparkContext.scala#L2162]
> splits the master URLs by comma and passes these URLs into the AppClient. In 
>standalone cluster mode, there is not equivalent logic to even split the 
>master URLs, whether in the old submission gateway (o.a.s.deploy.Client) or in 
>the new one (o.a.s.deploy.rest.StandaloneRestClient).

Thus, this is an unsupported feature, not a bug!

== Original description from Tao Wang ==

After digging some codes, I found user could not submit app in standalone 
cluster mode when HA is enabled. But in client mode it can work.

Haven't try yet. But I will verify this and file a PR to resolve it if the 
problem exists.

3/23 update:
I started a HA cluster with zk, and tried to submit SparkPi example with 
command:
./spark-submit  --class org.apache.spark.examples.SparkPi --master 
spark://doggie153:7077,doggie159:7077 --deploy-mode cluster 
../lib/spark-examples-1.2.0-hadoop2.4.0.jar 

and it failed with error message:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: 
spark://doggie153:7077,doggie159:7077
akka.actor.ActorInitializationException: exception during creation
at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
at akka.actor.ActorCell.create(ActorCell.scala:596)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.spark.SparkException: Invalid master URL: 
spark://doggie153:7077,doggie159:7077
at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42)
at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35)
at akka.actor.ActorCell.create(ActorCell.scala:580)
... 9 more

But in client mode it ended with correct result. So my guess is right. I will 
fix it in the related PR.

  was:
After digging some codes, I found user could not submit app in standalone 
cluster mode when HA is enabled. But in client mode it can work.

Haven't try yet. But I will verify this and file a PR to resolve it if the 
problem exists.

3/23 update:
I started a HA cluster with zk, and tried to submit SparkPi example with 
command:
./spark-submit  --class org.apache.spark.examples.SparkPi --master 
spark://doggie153:7077,doggie159:7077 --deploy-mode cluster 
../lib/spark-examples-1.2.0-hadoop2.4.0.jar 

and it failed with error message:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: 
spark://doggie153:7077,doggie159:7077
akka.actor.ActorInitializationException: exception during creation
at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
at akka.actor.ActorCell.create(ActorCell.scala:596)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.spark.SparkException: Invalid master URL: 
spark://doggie153:7077,doggie159:7077
at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
at org.apache.spark.deploy.Cl

[jira] [Assigned] (SPARK-2669) Hadoop configuration is not localised when submitting job in yarn-cluster mode

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2669:
---

Assignee: (was: Apache Spark)

> Hadoop configuration is not localised when submitting job in yarn-cluster mode
> --
>
> Key: SPARK-2669
> URL: https://issues.apache.org/jira/browse/SPARK-2669
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Maxim Ivanov
>
> I'd like to propose a fix for a problem when Hadoop configuration is not 
> localized when job is submitted in yarn-cluster mode. Here is a description 
> from github pull request https://github.com/apache/spark/pull/1574
> This patch fixes a problem when Spark driver is run in the container
> managed by YARN ResourceManager it inherits configuration from a
> NodeManager process, which can be different from the Hadoop
> configuration present on the client (submitting machine). Problem is
> most vivid when fs.defaultFS property differs between these two.
> Hadoop MR solves it by serializing client's Hadoop configuration into
> job.xml in application staging directory and then making Application
> Master to use it. That guarantees that regardless of execution nodes
> configurations all application containers use same config identical to
> one on the client side.
> This patch uses similar approach. YARN ClientBase serializes
> configuration and adds it to ClientDistributedCacheManager under
> "job.xml" link name. ClientDistributedCacheManager is then utilizes
> Hadoop localizer to deliver it to whatever container is started by this
> application, including the one running Spark driver.
> YARN ClientBase also adds "SPARK_LOCAL_HADOOPCONF" env variable to AM
> container request which is then used by SparkHadoopUtil.newConfiguration
> to trigger new behavior when machine-wide hadoop configuration is merged
> with application specific job.xml (exactly how it is done in Hadoop MR).
> SparkContext is then follows same approach, adding
> SPARK_LOCAL_HADOOPCONF env to all spawned containers to make them use
> client-side Hadopo configuration.
> Also all the references to "new Configuration()" which might be executed
> on YARN cluster side are changed to use SparkHadoopUtil.get.conf
> Please note that it fixes only core Spark, the part which I am
> comfortable to test and verify the result. I didn't descend into
> steaming/shark directories, so things might need to be changed there too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-2669) Hadoop configuration is not localised when submitting job in yarn-cluster mode

2015-04-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2669:
---

Assignee: Apache Spark

> Hadoop configuration is not localised when submitting job in yarn-cluster mode
> --
>
> Key: SPARK-2669
> URL: https://issues.apache.org/jira/browse/SPARK-2669
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Maxim Ivanov
>Assignee: Apache Spark
>
> I'd like to propose a fix for a problem when Hadoop configuration is not 
> localized when job is submitted in yarn-cluster mode. Here is a description 
> from github pull request https://github.com/apache/spark/pull/1574
> This patch fixes a problem when Spark driver is run in the container
> managed by YARN ResourceManager it inherits configuration from a
> NodeManager process, which can be different from the Hadoop
> configuration present on the client (submitting machine). Problem is
> most vivid when fs.defaultFS property differs between these two.
> Hadoop MR solves it by serializing client's Hadoop configuration into
> job.xml in application staging directory and then making Application
> Master to use it. That guarantees that regardless of execution nodes
> configurations all application containers use same config identical to
> one on the client side.
> This patch uses similar approach. YARN ClientBase serializes
> configuration and adds it to ClientDistributedCacheManager under
> "job.xml" link name. ClientDistributedCacheManager is then utilizes
> Hadoop localizer to deliver it to whatever container is started by this
> application, including the one running Spark driver.
> YARN ClientBase also adds "SPARK_LOCAL_HADOOPCONF" env variable to AM
> container request which is then used by SparkHadoopUtil.newConfiguration
> to trigger new behavior when machine-wide hadoop configuration is merged
> with application specific job.xml (exactly how it is done in Hadoop MR).
> SparkContext is then follows same approach, adding
> SPARK_LOCAL_HADOOPCONF env to all spawned containers to make them use
> client-side Hadopo configuration.
> Also all the references to "new Configuration()" which might be executed
> on YARN cluster side are changed to use SparkHadoopUtil.get.conf
> Please note that it fixes only core Spark, the part which I am
> comfortable to test and verify the result. I didn't descend into
> steaming/shark directories, so things might need to be changed there too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 228 matches

Mail list logo