from:"Kan Zhang \(JIRA\)"

[jira] [Commented] (SPARK-8417) spark-class has illegal statement

2015-06-24 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14599899#comment-14599899
 ] 

Kan Zhang commented on SPARK-8417:
--

[~blipe] how to reproduce the error you saw? The command looks like process 
substitution. See http://www.tldp.org/LDP/abs/html/process-sub.html

> spark-class has illegal statement
> -
>
> Key: SPARK-8417
> URL: https://issues.apache.org/jira/browse/SPARK-8417
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.4.0
>Reporter: jweinste
>
> spark-class
> There is an illegal statement.
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> Complaint is
> ./bin/spark-class: line 100: syntax error near unexpected token `<'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8129) Securely pass auth secrets to executors in standalone cluster mode

2015-06-18 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592243#comment-14592243
 ] 

Kan Zhang commented on SPARK-8129:
--

Checked on my Linux box that env variables are only shown by `ps` if it is run 
by the process owner or root. This is ok for our purpose of preventing 
non-Spark users from getting the auth secret. Please report if this is not true 
in some environments.

> Securely pass auth secrets to executors in standalone cluster mode
> --
>
> Key: SPARK-8129
> URL: https://issues.apache.org/jira/browse/SPARK-8129
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Core
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>Priority: Minor
> Fix For: 1.5.0
>
>
> Currently, when authentication is turned on, the standalone cluster manager 
> passes auth secrets to executors (also drivers in cluster mode) as java 
> options on the command line, which isn't secure. The passed secret can be 
> seen by anyone running 'ps' command, e.g.,
> bq.  501 94787 94734   0  2:32PM ?? 0:00.78 
> /Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
> -cp 
> /Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
>  -Xms512M -Xmx512M 
> *-Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
> -Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
> --executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
> app-20150605143259- --worker-url 
> akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8129) Securely pass auth secrets to executors in standalone cluster mode

2015-06-06 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-8129:
-
Description: 
Currently, when authentication is turned on, the standalone cluster manager 
passes auth secrets to executors (also drivers in cluster mode) as java options 
on the command line, which isn't secure. The passed secret can be seen by 
anyone running 'ps' command, e.g.,


bq.  501 94787 94734   0  2:32PM ?? 0:00.78 
/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
-cp 
/Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
 -Xms512M -Xmx512M 
*-Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
-Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
--executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
app-20150605143259- --worker-url 
akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker



  was:
Currently, when authentication is turned on, cluster manager passes auth 
secrets to executors (also drivers in cluster mode) as java options on the 
command line, which isn't secure. The passed secret can be seen by anyone 
running 'ps' command, e.g.,


bq.  501 94787 94734   0  2:32PM ?? 0:00.78 
/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
-cp 
/Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
 -Xms512M -Xmx512M 
*-Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
-Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
--executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
app-20150605143259- --worker-url 
akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker




> Securely pass auth secrets to executors in standalone cluster mode
> --
>
> Key: SPARK-8129
> URL: https://issues.apache.org/jira/browse/SPARK-8129
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Core
>Reporter: Kan Zhang
>Priority: Critical
>
> Currently, when authentication is turned on, the standalone cluster manager 
> passes auth secrets to executors (also drivers in cluster mode) as java 
> options on the command line, which isn't secure. The passed secret can be 
> seen by anyone running 'ps' command, e.g.,
> bq.  501 94787 94734   0  2:32PM ?? 0:00.78 
> /Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
> -cp 
> /Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
>  -Xms512M -Xmx512M 
> *-Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
> -Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
> --executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
> app-20150605143259- --worker-url 
> akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8129) Securely pass auth secrets to executors in standalone cluster mode

2015-06-05 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-8129:
-
Summary: Securely pass auth secrets to executors in standalone cluster mode 
 (was: Securely pass auth secret to executors in standalone cluster mode)

> Securely pass auth secrets to executors in standalone cluster mode
> --
>
> Key: SPARK-8129
> URL: https://issues.apache.org/jira/browse/SPARK-8129
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Core
>Reporter: Kan Zhang
>Priority: Critical
>
> Currently, when authentication is turned on, cluster manager passes auth 
> secrets to executors (also drivers in cluster mode) as java options on the 
> command line, which isn't secure. The passed secret can be seen by anyone 
> running 'ps' command, e.g.,
> bq.  501 94787 94734   0  2:32PM ?? 0:00.78 
> /Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
> -cp 
> /Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
>  -Xms512M -Xmx512M 
> *-Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
> -Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
> --executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
> app-20150605143259- --worker-url 
> akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8129) Securely pass auth secret to executors in standalone cluster mode

2015-06-05 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-8129:
-
Description: 
Currently, when authentication is turned on, cluster manager passes auth 
secrets to executors (also drivers in cluster mode) as java options on the 
command line, which isn't secure. The passed secret can be seen by anyone 
running 'ps' command, e.g.,


bq.  501 94787 94734   0  2:32PM ?? 0:00.78 
/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
-cp 
/Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
 -Xms512M -Xmx512M 
*-Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
-Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
--executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
app-20150605143259- --worker-url 
akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker



  was:
Currently, when authentication is turned on, Worker passes auth secret to 
executors (also drivers in cluster mode) as java options on the command line, 
which isn't secure. The passed secret can be seen by anyone running 'ps' 
command, e.g.,


bq.  501 94787 94734   0  2:32PM ?? 0:00.78 
/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
-cp 
/Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
 -Xms512M -Xmx512M 
*-Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
-Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
--executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
app-20150605143259- --worker-url 
akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker




> Securely pass auth secret to executors in standalone cluster mode
> -
>
> Key: SPARK-8129
> URL: https://issues.apache.org/jira/browse/SPARK-8129
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Core
>Reporter: Kan Zhang
>Priority: Critical
>
> Currently, when authentication is turned on, cluster manager passes auth 
> secrets to executors (also drivers in cluster mode) as java options on the 
> command line, which isn't secure. The passed secret can be seen by anyone 
> running 'ps' command, e.g.,
> bq.  501 94787 94734   0  2:32PM ?? 0:00.78 
> /Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
> -cp 
> /Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
>  -Xms512M -Xmx512M 
> *-Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
> -Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
> --executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
> app-20150605143259- --worker-url 
> akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8129) Securely pass auth secret to executors in standalone cluster mode

2015-06-05 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-8129:
-
Description: 
Currently, when authentication is turned on, Worker passes auth secret to 
executors (also drivers in cluster mode) as java options on the command line, 
which isn't secure. The passed secret can be seen by anyone running 'ps' 
command, e.g.,


bq.  501 94787 94734   0  2:32PM ?? 0:00.78 
/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
-cp 
/Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
 -Xms512M -Xmx512M 
*-Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
-Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
--executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
app-20150605143259- --worker-url 
akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker



  was:
Currently, when authentication is turned on, Worker passes auth secret to 
executors (also drivers in cluster mode) as java options on the command line, 
which isn't secure. The passed secret can be seen by anyone running 'ps' 
command, e.g.,


  501 94787 94734   0  2:32PM ?? 0:00.78 
/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
-cp 
/Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
 -Xms512M -Xmx512M 
-*Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
-Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
--executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
app-20150605143259- --worker-url 
akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker
 



> Securely pass auth secret to executors in standalone cluster mode
> -
>
> Key: SPARK-8129
> URL: https://issues.apache.org/jira/browse/SPARK-8129
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Core
>Reporter: Kan Zhang
>Priority: Critical
>
> Currently, when authentication is turned on, Worker passes auth secret to 
> executors (also drivers in cluster mode) as java options on the command line, 
> which isn't secure. The passed secret can be seen by anyone running 'ps' 
> command, e.g.,
> bq.  501 94787 94734   0  2:32PM ?? 0:00.78 
> /Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
> -cp 
> /Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
>  -Xms512M -Xmx512M 
> *-Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
> -Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
> --executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
> app-20150605143259- --worker-url 
> akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8129) Securely pass auth secret to executors in standalone cluster mode

2015-06-05 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-8129:
-
Description: 
Currently, when authentication is turned on, Worker passes auth secret to 
executors (also drivers in cluster mode) as java options on the command line, 
which isn't secure. The passed secret can be seen by anyone running 'ps' 
command, e.g.,


  501 94787 94734   0  2:32PM ?? 0:00.78 
/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
-cp 
/Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
 -Xms512M -Xmx512M 
-*Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
-Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
--executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
app-20150605143259- --worker-url 
akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker
 


  was:
Currently, when authentication is turned on, Worker passes auth secret to 
executors (also drivers in cluster mode) as java options on the command line, 
which isn't secure. The passed secret can be seen by anyone running 'ps' 
command, e.g.,

```
ps -ef

..

  501 94787 94734   0  2:32PM ?? 0:00.78 
/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
-cp 
/Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
 -Xms512M -Xmx512M 
-*Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
-Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
--executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
app-20150605143259- --worker-url 
akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker
``` 



> Securely pass auth secret to executors in standalone cluster mode
> -
>
> Key: SPARK-8129
> URL: https://issues.apache.org/jira/browse/SPARK-8129
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Core
>Reporter: Kan Zhang
>Priority: Critical
>
> Currently, when authentication is turned on, Worker passes auth secret to 
> executors (also drivers in cluster mode) as java options on the command line, 
> which isn't secure. The passed secret can be seen by anyone running 'ps' 
> command, e.g.,
>   501 94787 94734   0  2:32PM ?? 0:00.78 
> /Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
> -cp 
> /Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
>  -Xms512M -Xmx512M 
> -*Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
> -Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
> --executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
> app-20150605143259- --worker-url 
> akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8129) Securely pass auth secret to executors in standalone cluster mode

2015-06-05 Thread Kan Zhang (JIRA)

Kan Zhang created SPARK-8129:


 Summary: Securely pass auth secret to executors in standalone 
cluster mode
 Key: SPARK-8129
 URL: https://issues.apache.org/jira/browse/SPARK-8129
 Project: Spark
  Issue Type: New Feature
  Components: Deploy, Spark Core
Reporter: Kan Zhang
Priority: Critical


Currently, when authentication is turned on, Worker passes auth secret to 
executors (also drivers in cluster mode) as java options on the command line, 
which isn't secure. The passed secret can be seen by anyone running 'ps' 
command, e.g.,

```
ps -ef

..

  501 94787 94734   0  2:32PM ?? 0:00.78 
/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
-cp 
/Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
 -Xms512M -Xmx512M 
-*Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
-Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
--executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
app-20150605143259- --worker-url 
akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker
``` 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1475) Drain event logging queue before stopping event logger

2014-09-22 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1475:
-
Summary: Drain event logging queue before stopping event logger  (was: 
Draining event logging queue before stopping event logger)

> Drain event logging queue before stopping event logger
> --
>
> Key: SPARK-1475
> URL: https://issues.apache.org/jira/browse/SPARK-1475
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>Priority: Blocker
> Fix For: 1.0.0
>
>
> When stopping SparkListenerBus, its event queue needs to be drained. And this 
> needs to happen before event logger is stopped. Otherwise, any event still 
> waiting to be processed in the queue may be lost and consequently event log 
> file may be incomplete. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2736) Create PySpark RDD from Apache Avro File

2014-07-30 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-2736:
-

Summary: Create PySpark RDD from Apache Avro File  (was: Create Pyspark RDD 
from Apache Avro File)

> Create PySpark RDD from Apache Avro File
> 
>
> Key: SPARK-2736
> URL: https://issues.apache.org/jira/browse/SPARK-2736
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Eric Garcia
>Assignee: Kan Zhang
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> There is a partially working example Avro Converter at this pull request: 
> https://github.com/apache/spark/pull/1536
> It does not fully implement all types in the Avro format and could be cleaned 
> up a little bit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2736) Create Pyspark RDD from Apache Avro File

2014-07-30 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-2736:
-

Summary: Create Pyspark RDD from Apache Avro File  (was: Ceeate Pyspark RDD 
from Apache Avro File)

> Create Pyspark RDD from Apache Avro File
> 
>
> Key: SPARK-2736
> URL: https://issues.apache.org/jira/browse/SPARK-2736
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Eric Garcia
>Assignee: Kan Zhang
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> There is a partially working example Avro Converter at this pull request: 
> https://github.com/apache/spark/pull/1536
> It does not fully implement all types in the Avro format and could be cleaned 
> up a little bit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1687) Support NamedTuples in RDDs

2014-07-28 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077019#comment-14077019
 ] 

Kan Zhang commented on SPARK-1687:
--

Sure, pls go ahead and feel free to take over this JIRA.

> Support NamedTuples in RDDs
> ---
>
> Key: SPARK-1687
> URL: https://issues.apache.org/jira/browse/SPARK-1687
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.0.0
> Environment: Spark version 1.0.0-SNAPSHOT
> Python 2.7.5
>Reporter: Pat McDonough
>Assignee: Kan Zhang
>
> Add Support for NamedTuples in RDDs. Some sample code is below, followed by 
> the current error that comes from it.
> Based on a quick conversation with [~ahirreddy], 
> [Dill|https://github.com/uqfoundation/dill] might be a good solution here.
> {code}
> In [26]: from collections import namedtuple
> ...
> In [33]: Person = namedtuple('Person', 'id firstName lastName')
> In [34]: jon = Person(1, "Jon", "Doe")
> In [35]: jane = Person(2, "Jane", "Doe")
> In [36]: theDoes = sc.parallelize((jon, jane))
> In [37]: theDoes.collect()
> Out[37]: 
> [Person(id=1, firstName='Jon', lastName='Doe'),
>  Person(id=2, firstName='Jane', lastName='Doe')]
> In [38]: theDoes.count()
> PySpark worker failed with exception:
> PySpark worker failed with exception:
> Traceback (most recent call last):
>   File "/Users/pat/Projects/spark/python/pyspark/worker.py", line 77, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 1373, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 1373, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 283, in func
> def func(s, iterator): return f(iterator)
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 708, in 
> 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 708, in 
> 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/pat/Projects/spark/python/pyspark/serializers.py", line 129, 
> in load_stream
> yield self._read_with_length(stream)
>   File "/Users/pat/Projects/spark/python/pyspark/serializers.py", line 146, 
> in _read_with_length
> return self.loads(obj)
> AttributeError: 'module' object has no attribute 'Person'
> Traceback (most recent call last):
>   File "/Users/pat/Projects/spark/python/pyspark/worker.py", line 77, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 1373, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 1373, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 283, in func
> def func(s, iterator): return f(iterator)
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 708, in 
> 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 708, in 
> 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/pat/Projects/spark/python/pyspark/serializers.py", line 129, 
> in load_stream
> yield self._read_with_length(stream)
>   File "/Users/pat/Projects/spark/python/pyspark/serializers.py", line 146, 
> in _read_with_length
> return self.loads(obj)
> AttributeError: 'module' object has no attribute 'Person'
> 14/04/30 14:43:53 ERROR Executor: Exception in task ID 23
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/Users/pat/Projects/spark/python/pyspark/worker.py", line 77, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 1373, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 1373, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 283, in func
> def func(s, iterator): return f(iterator)
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 708, in 
> 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 708, in 
> 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/pat/Projects/spark/python/pyspark/serializers.py", l

[jira] [Commented] (SPARK-2141) Add sc.getPersistentRDDs() to PySpark

2014-07-28 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076262#comment-14076262
 ] 

Kan Zhang commented on SPARK-2141:
--

Hi [~nchammas], we are debating potential use cases for this feature. Would be 
great if you could provide your input (use above link). Thx.

> Add sc.getPersistentRDDs() to PySpark
> -
>
> Key: SPARK-2141
> URL: https://issues.apache.org/jira/browse/SPARK-2141
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.0.0
>Reporter: Nicholas Chammas
>Assignee: Kan Zhang
>
> PySpark does not appear to have {{sc.getPersistentRDDs()}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1866) Closure cleaner does not null shadowed fields when outer scope is referenced

2014-07-14 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061501#comment-14061501
 ] 

Kan Zhang commented on SPARK-1866:
--

My previous comment may be less readable, let me try again:

The root cause is when the class for line {{sc.parallelize()...}} is generated, 
variable {{instances}} defined in the preceding line gets imported by the 
parser (since it thinks {{instances}} is referenced by this line) and becomes 
part of the outer object for the closure. This outer object is referenced by 
the closure through variable {{x}}. However, currently we choose not to null 
(or clone) outer objects when we clean closures since we can't be sure it is 
safe to do so (see commit 
[f346e64|https://github.com/apache/spark/commit/f346e64637fa4f9bd95fcc966caa496bea5feca0]).
 As a result, {{instances}} is not nulled by ClosureCleaner even though it is 
not actually used within the closure. This type of exception will pop up 
whenever a closure references outer objects that are not serializable.


> Closure cleaner does not null shadowed fields when outer scope is referenced
> 
>
> Key: SPARK-1866
> URL: https://issues.apache.org/jira/browse/SPARK-1866
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Assignee: Kan Zhang
>Priority: Critical
> Fix For: 1.0.1, 1.1.0
>
>
> Take the following example:
> {code}
> val x = 5
> val instances = new org.apache.hadoop.fs.Path("/") /* non-serializable */
> sc.parallelize(0 until 10).map { _ =>
>   val instances = 3
>   (instances, x)
> }.collect
> {code}
> This produces a "java.io.NotSerializableException: 
> org.apache.hadoop.fs.Path", despite the fact that the outer instances is not 
> actually used within the closure. If you change the name of the outer 
> variable instances to something else, the code executes correctly, indicating 
> that it is the fact that the two variables share a name that causes the issue.
> Additionally, if the outer scope is not used (i.e., we do not reference "x" 
> in the above example), the issue does not appear.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Issue Comment Deleted] (SPARK-1866) Closure cleaner does not null shadowed fields when outer scope is referenced

2014-07-14 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1866:
-

Comment: was deleted

(was: Unfortunately this type of error will pop up whenever a closure 
references user objects (any objects other than nested closure objects) that 
are not serializable. Our current approach is we don't clone (or null) user 
objects since we can't be sure it is safe to do so (see commit 
f346e64637fa4f9bd95fcc966caa496bea5feca0). 

Spark shell synthesizes a class for each line. In this case, the class for the 
closure line imports {{instances}} as a field (since the parser thinks it is 
referenced by this line) and the corresponding line object is referenced by the 
closure via {{x}}. 

My take on this is advising users to avoid name collisions as a workaround.)

> Closure cleaner does not null shadowed fields when outer scope is referenced
> 
>
> Key: SPARK-1866
> URL: https://issues.apache.org/jira/browse/SPARK-1866
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Assignee: Kan Zhang
>Priority: Critical
> Fix For: 1.0.1, 1.1.0
>
>
> Take the following example:
> {code}
> val x = 5
> val instances = new org.apache.hadoop.fs.Path("/") /* non-serializable */
> sc.parallelize(0 until 10).map { _ =>
>   val instances = 3
>   (instances, x)
> }.collect
> {code}
> This produces a "java.io.NotSerializableException: 
> org.apache.hadoop.fs.Path", despite the fact that the outer instances is not 
> actually used within the closure. If you change the name of the outer 
> variable instances to something else, the code executes correctly, indicating 
> that it is the fact that the two variables share a name that causes the issue.
> Additionally, if the outer scope is not used (i.e., we do not reference "x" 
> in the above example), the issue does not appear.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2024) Add saveAsSequenceFile to PySpark

2014-07-08 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055912#comment-14055912
 ] 

Kan Zhang commented on SPARK-2024:
--

https://github.com/apache/spark/pull/1338

> Add saveAsSequenceFile to PySpark
> -
>
> Key: SPARK-2024
> URL: https://issues.apache.org/jira/browse/SPARK-2024
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Matei Zaharia
>Assignee: Kan Zhang
>
> After SPARK-1416 we will be able to read SequenceFiles from Python, but it 
> remains to write them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2010) Support for nested data in PySpark SQL

2014-07-03 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052068#comment-14052068
 ] 

Kan Zhang commented on SPARK-2010:
--

Sounds reasonable. Named tuple is a better fit than dictionary for struct type. 
Presumably it is due to lack of pickling support for named tuple that we 
resorted to dictionary for python schema definition. But for nested 
dictionaries, we should treat them as map type.

> Support for nested data in PySpark SQL
> --
>
> Key: SPARK-2010
> URL: https://issues.apache.org/jira/browse/SPARK-2010
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Kan Zhang
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2130) Clarify PySpark docs for RDD.getStorageLevel

2014-06-16 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-2130:
-

Component/s: (was: Documentation)
 PySpark

> Clarify PySpark docs for RDD.getStorageLevel
> 
>
> Key: SPARK-2130
> URL: https://issues.apache.org/jira/browse/SPARK-2130
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.0.0
>Reporter: Nicholas Chammas
>Assignee: Kan Zhang
>Priority: Minor
>
> The [PySpark docs for 
> RDD.getStorageLevel|http://spark.apache.org/docs/1.0.0/api/python/pyspark.rdd.RDD-class.html#getStorageLevel]
>  are unclear.
> {quote}
> >>> rdd1 = sc.parallelize([1,2]) 
> >>> rdd1.getStorageLevel() 
> StorageLevel(False, False, False, False, 1)
> {quote}
> What do the 5 values of "False, False, False, False, 1" mean?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2010) Support for nested data in PySpark SQL

2014-06-16 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14032951#comment-14032951
 ] 

Kan Zhang commented on SPARK-2010:
--

Sure.

Not sure if you saw it, I did post a reply to your question in the PR and 
raised a couple questions of my own. Pls take a look when you get a chance. Thx!

> Support for nested data in PySpark SQL
> --
>
> Key: SPARK-2010
> URL: https://issues.apache.org/jira/browse/SPARK-2010
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Kan Zhang
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2130) Clarify PySpark docs for RDD.getStorageLevel

2014-06-16 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14032942#comment-14032942
 ] 

Kan Zhang commented on SPARK-2130:
--

https://github.com/apache/spark/pull/1096

> Clarify PySpark docs for RDD.getStorageLevel
> 
>
> Key: SPARK-2130
> URL: https://issues.apache.org/jira/browse/SPARK-2130
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.0.0
>Reporter: Nicholas Chammas
>Assignee: Kan Zhang
>Priority: Minor
>
> The [PySpark docs for 
> RDD.getStorageLevel|http://spark.apache.org/docs/1.0.0/api/python/pyspark.rdd.RDD-class.html#getStorageLevel]
>  are unclear.
> {quote}
> >>> rdd1 = sc.parallelize([1,2]) 
> >>> rdd1.getStorageLevel() 
> StorageLevel(False, False, False, False, 1)
> {quote}
> What do the 5 values of "False, False, False, False, 1" mean?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2013) Add Python pickleFile to programming guide

2014-06-15 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang resolved SPARK-2013.
--

Resolution: Fixed

> Add Python pickleFile to programming guide
> --
>
> Key: SPARK-2013
> URL: https://issues.apache.org/jira/browse/SPARK-2013
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Reporter: Matei Zaharia
>Assignee: Kan Zhang
>Priority: Trivial
> Fix For: 1.1.0
>
>
> Should be added in the Python version of 
> http://spark.apache.org/docs/latest/programming-guide.html#external-datasets.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2141) Add sc.getPersistentRDDs() to PySpark

2014-06-13 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031356#comment-14031356
 ] 

Kan Zhang commented on SPARK-2141:
--

https://github.com/apache/spark/pull/1082

> Add sc.getPersistentRDDs() to PySpark
> -
>
> Key: SPARK-2141
> URL: https://issues.apache.org/jira/browse/SPARK-2141
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.0.0
>Reporter: Nicholas Chammas
>Assignee: Kan Zhang
>
> PySpark does not appear to have {{sc.getPersistentRDDs()}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2079) Support batching when serializing SchemaRDD to Python

2014-06-12 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-2079:
-

Summary: Support batching when serializing SchemaRDD to Python  (was: 
Removing unnecessary wrapping when serializing SchemaRDD to Python)

> Support batching when serializing SchemaRDD to Python
> -
>
> Key: SPARK-2079
> URL: https://issues.apache.org/jira/browse/SPARK-2079
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>
> Finishing the TODO:
> {code}
>   private[sql] def javaToPython: JavaRDD[Array[Byte]] = {
> val fieldNames: Seq[String] = 
> this.queryExecution.analyzed.output.map(_.name)
> this.mapPartitions { iter =>
>   val pickle = new Pickler
>   iter.map { row =>
> val map: JMap[String, Any] = new java.util.HashMap
> // TODO: We place the map in an ArrayList so that the object is 
> pickled to a List[Dict].
> // Ideally we should be able to pickle an object directly into a 
> Python collection so we
> // don't have to create an ArrayList every time.
> val arr: java.util.ArrayList[Any] = new java.util.ArrayList
> row.zip(fieldNames).foreach { case (obj, name) =>
>   map.put(name, obj)
> }
> arr.add(map)
> pickle.dumps(arr)
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2010) Support for nested data in PySpark SQL

2014-06-10 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027127#comment-14027127
 ] 

Kan Zhang commented on SPARK-2010:
--

PR: https://github.com/apache/spark/pull/1041

> Support for nested data in PySpark SQL
> --
>
> Key: SPARK-2010
> URL: https://issues.apache.org/jira/browse/SPARK-2010
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Kan Zhang
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2079) Removing unnecessary wrapping when serializing SchemaRDD to Python

2014-06-09 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-2079:
-

Summary: Removing unnecessary wrapping when serializing SchemaRDD to Python 
 (was: Skip unnecessary wrapping in List when serializing SchemaRDD to Python)

> Removing unnecessary wrapping when serializing SchemaRDD to Python
> --
>
> Key: SPARK-2079
> URL: https://issues.apache.org/jira/browse/SPARK-2079
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>
> Finishing the TODO:
> {code}
>   private[sql] def javaToPython: JavaRDD[Array[Byte]] = {
> val fieldNames: Seq[String] = 
> this.queryExecution.analyzed.output.map(_.name)
> this.mapPartitions { iter =>
>   val pickle = new Pickler
>   iter.map { row =>
> val map: JMap[String, Any] = new java.util.HashMap
> // TODO: We place the map in an ArrayList so that the object is 
> pickled to a List[Dict].
> // Ideally we should be able to pickle an object directly into a 
> Python collection so we
> // don't have to create an ArrayList every time.
> val arr: java.util.ArrayList[Any] = new java.util.ArrayList
> row.zip(fieldNames).foreach { case (obj, name) =>
>   map.put(name, obj)
> }
> arr.add(map)
> pickle.dumps(arr)
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2079) Skip unnecessary wrapping in List when serializing SchemaRDD to Python

2014-06-09 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025363#comment-14025363
 ] 

Kan Zhang commented on SPARK-2079:
--

PR: https://github.com/apache/spark/pull/1023

> Skip unnecessary wrapping in List when serializing SchemaRDD to Python
> --
>
> Key: SPARK-2079
> URL: https://issues.apache.org/jira/browse/SPARK-2079
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>
> Finishing the TODO:
> {code}
>   private[sql] def javaToPython: JavaRDD[Array[Byte]] = {
> val fieldNames: Seq[String] = 
> this.queryExecution.analyzed.output.map(_.name)
> this.mapPartitions { iter =>
>   val pickle = new Pickler
>   iter.map { row =>
> val map: JMap[String, Any] = new java.util.HashMap
> // TODO: We place the map in an ArrayList so that the object is 
> pickled to a List[Dict].
> // Ideally we should be able to pickle an object directly into a 
> Python collection so we
> // don't have to create an ArrayList every time.
> val arr: java.util.ArrayList[Any] = new java.util.ArrayList
> row.zip(fieldNames).foreach { case (obj, name) =>
>   map.put(name, obj)
> }
> arr.add(map)
> pickle.dumps(arr)
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2079) Skip unnecessary wrapping in List when serializing SchemaRDD to Python

2014-06-09 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-2079:
-

Description: 
Finishing the TODO:
{code}
  private[sql] def javaToPython: JavaRDD[Array[Byte]] = {
val fieldNames: Seq[String] = 
this.queryExecution.analyzed.output.map(_.name)
this.mapPartitions { iter =>
  val pickle = new Pickler
  iter.map { row =>
val map: JMap[String, Any] = new java.util.HashMap
// TODO: We place the map in an ArrayList so that the object is pickled 
to a List[Dict].
// Ideally we should be able to pickle an object directly into a Python 
collection so we
// don't have to create an ArrayList every time.
val arr: java.util.ArrayList[Any] = new java.util.ArrayList
row.zip(fieldNames).foreach { case (obj, name) =>
  map.put(name, obj)
}
arr.add(map)
pickle.dumps(arr)
  }
}
  }
{code}

> Skip unnecessary wrapping in List when serializing SchemaRDD to Python
> --
>
> Key: SPARK-2079
> URL: https://issues.apache.org/jira/browse/SPARK-2079
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>
> Finishing the TODO:
> {code}
>   private[sql] def javaToPython: JavaRDD[Array[Byte]] = {
> val fieldNames: Seq[String] = 
> this.queryExecution.analyzed.output.map(_.name)
> this.mapPartitions { iter =>
>   val pickle = new Pickler
>   iter.map { row =>
> val map: JMap[String, Any] = new java.util.HashMap
> // TODO: We place the map in an ArrayList so that the object is 
> pickled to a List[Dict].
> // Ideally we should be able to pickle an object directly into a 
> Python collection so we
> // don't have to create an ArrayList every time.
> val arr: java.util.ArrayList[Any] = new java.util.ArrayList
> row.zip(fieldNames).foreach { case (obj, name) =>
>   map.put(name, obj)
> }
> arr.add(map)
> pickle.dumps(arr)
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2079) Skip unnecessary wrapping in List when serializing SchemaRDD to Python

2014-06-09 Thread Kan Zhang (JIRA)

Kan Zhang created SPARK-2079:


 Summary: Skip unnecessary wrapping in List when serializing 
SchemaRDD to Python
 Key: SPARK-2079
 URL: https://issues.apache.org/jira/browse/SPARK-2079
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 1.0.0
Reporter: Kan Zhang






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-937) Executors that exit cleanly should not have KILLED status

2014-06-05 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019309#comment-14019309
 ] 

Kan Zhang commented on SPARK-937:
-

PR: https://github.com/apache/spark/pull/306

> Executors that exit cleanly should not have KILLED status
> -
>
> Key: SPARK-937
> URL: https://issues.apache.org/jira/browse/SPARK-937
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 0.7.3
>Reporter: Aaron Davidson
>Assignee: Kan Zhang
>Priority: Critical
> Fix For: 1.1.0
>
>
> This is an unintuitive and overloaded status message when Executors are 
> killed during normal termination of an application.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Issue Comment Deleted] (SPARK-1118) Executor state shows as KILLED even the application is finished normally

2014-06-05 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1118:
-

Comment: was deleted

(was: PR: https://github.com/apache/spark/pull/306
)

> Executor state shows as KILLED even the application is finished normally
> 
>
> Key: SPARK-1118
> URL: https://issues.apache.org/jira/browse/SPARK-1118
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Nan Zhu
>Assignee: Kan Zhang
>Priority: Minor
> Fix For: 1.0.0
>
>
> This seems weird, ExecutorState has no option of FINISHED, a terminated 
> executor can only be KILLED, FAILED, LOST



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Issue Comment Deleted] (SPARK-937) Executors that exit cleanly should not have KILLED status

2014-06-05 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-937:


Comment: was deleted

(was: Hi Aaron, are you still working on this one? If not, could you assign it 
to me? I have a PR for SPARK-1118 (closed as a duplicate of this JIRA) that I 
could re-sumit for this one. If you are still working on it or plan to, feel 
free to pick whatever might be useful to you 
https://github.com/apache/spark/pull/306)

> Executors that exit cleanly should not have KILLED status
> -
>
> Key: SPARK-937
> URL: https://issues.apache.org/jira/browse/SPARK-937
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 0.7.3
>Reporter: Aaron Davidson
>Assignee: Kan Zhang
>Priority: Critical
> Fix For: 1.1.0
>
>
> This is an unintuitive and overloaded status message when Executors are 
> killed during normal termination of an application.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2013) Add Python pickleFile to programming guide

2014-06-05 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018982#comment-14018982
 ] 

Kan Zhang commented on SPARK-2013:
--

PR: https://github.com/apache/spark/pull/983

> Add Python pickleFile to programming guide
> --
>
> Key: SPARK-2013
> URL: https://issues.apache.org/jira/browse/SPARK-2013
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Reporter: Matei Zaharia
>Assignee: Kan Zhang
>Priority: Trivial
> Fix For: 1.1.0
>
>
> Should be added in the Python version of 
> http://spark.apache.org/docs/latest/programming-guide.html#external-datasets.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Issue Comment Deleted] (SPARK-2024) Add saveAsSequenceFile to PySpark

2014-06-05 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-2024:
-

Comment: was deleted

(was: You meant SPARK-1416?)

> Add saveAsSequenceFile to PySpark
> -
>
> Key: SPARK-2024
> URL: https://issues.apache.org/jira/browse/SPARK-2024
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Matei Zaharia
>
> After SPARK-1416 we will be able to read SequenceFiles from Python, but it 
> remains to write them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2024) Add saveAsSequenceFile to PySpark

2014-06-04 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018485#comment-14018485
 ] 

Kan Zhang commented on SPARK-2024:
--

You meant SPARK-1416?

> Add saveAsSequenceFile to PySpark
> -
>
> Key: SPARK-2024
> URL: https://issues.apache.org/jira/browse/SPARK-2024
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Matei Zaharia
>
> After SPARK-1414 we will be able to read SequenceFiles from Python, but it 
> remains to write them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Issue Comment Deleted] (SPARK-1817) RDD zip erroneous when partitions do not divide RDD count

2014-06-04 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1817:
-

Comment: was deleted

(was: PR: https://github.com/apache/spark/pull/760)

> RDD zip erroneous when partitions do not divide RDD count
> -
>
> Key: SPARK-1817
> URL: https://issues.apache.org/jira/browse/SPARK-1817
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Michael Malak
>Assignee: Kan Zhang
> Fix For: 1.1.0
>
>
> Example:
> scala> sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect
> res1: Array[(Long, Int)] = Array((2,11))
> But more generally, it's whenever the number of partitions does not evenly 
> divide the total number of elements in the RDD.
> See https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1817) RDD zip erroneous when partitions do not divide RDD count

2014-06-04 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14017812#comment-14017812
 ] 

Kan Zhang commented on SPARK-1817:
--

There are 2 issues related to this bug. One is that we partition numeric ranges 
(e.g., Long and Double ranges) differently from other types of sequences (i.e, 
at different indexes). This causes elements to be dropped when zipping with 
numeric ranges since we zip by partition and partitions for numeric ranges may 
have different sizes from other sequences (even if the total length and the 
number of partitions are the same). This is fixed in SPARK-1837. One caveat is 
currently partitioning Double ranges still doesn't work properly due to a Scala 
bug that breaks {{take}} and {{drop}} on Double ranges 
(https://issues.scala-lang.org/browse/SI-8518).

The other issue is instead of dropping elements silently, we should throw an 
error during zipping when we found out that partition sizes are not the same 
between 2 sequences. This is fixed by https://github.com/apache/spark/pull/944

> RDD zip erroneous when partitions do not divide RDD count
> -
>
> Key: SPARK-1817
> URL: https://issues.apache.org/jira/browse/SPARK-1817
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Michael Malak
>Assignee: Kan Zhang
> Fix For: 1.1.0
>
>
> Example:
> scala> sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect
> res1: Array[(Long, Int)] = Array((2,11))
> But more generally, it's whenever the number of partitions does not evenly 
> divide the total number of elements in the RDD.
> See https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1866) Closure cleaner does not null shadowed fields when outer scope is referenced

2014-05-29 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013264#comment-14013264
 ] 

Kan Zhang edited comment on SPARK-1866 at 5/30/14 6:21 AM:
---

Unfortunately this type of error will pop up whenever a closure references user 
objects (any objects other than nested closure objects) that are not 
serializable. Our current approach is we don't clone (or null) user objects 
since we can't be sure it is safe to do so (see commit 
f346e64637fa4f9bd95fcc966caa496bea5feca0). 

Spark shell synthesizes a class for each line. In this case, the class for the 
closure line imports {{instances}} as a field (since the parser thinks it is 
referenced by this line) and the corresponding line object is referenced by the 
closure via {{x}}. 

My take on this is advising users to avoid name collisions as a workaround.


was (Author: kzhang):
Unfortunately this type of error will pop up whenever a closure references user 
objects (any objects other than nested closure objects) that are not 
serializable. Our current approach is we don't clone (or null) user objects 
since we can't be sure it is safe to do so (see commit 
f346e64637fa4f9bd95fcc966caa496bea5feca0). 

Spark shell (REPL) synthesizes a class for each line. In this case, the class 
for the closure line imports {{instances}} as a field (since the parser thinks 
it is referenced by this line) and the corresponding line object is referenced 
by the closure via {{x}}. 

My take on this is advising users to avoid name collisions as a workaround.

> Closure cleaner does not null shadowed fields when outer scope is referenced
> 
>
> Key: SPARK-1866
> URL: https://issues.apache.org/jira/browse/SPARK-1866
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Assignee: Kan Zhang
>Priority: Critical
> Fix For: 1.1.0, 1.0.1
>
>
> Take the following example:
> {code}
> val x = 5
> val instances = new org.apache.hadoop.fs.Path("/") /* non-serializable */
> sc.parallelize(0 until 10).map { _ =>
>   val instances = 3
>   (instances, x)
> }.collect
> {code}
> This produces a "java.io.NotSerializableException: 
> org.apache.hadoop.fs.Path", despite the fact that the outer instances is not 
> actually used within the closure. If you change the name of the outer 
> variable instances to something else, the code executes correctly, indicating 
> that it is the fact that the two variables share a name that causes the issue.
> Additionally, if the outer scope is not used (i.e., we do not reference "x" 
> in the above example), the issue does not appear.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1866) Closure cleaner does not null shadowed fields when outer scope is referenced

2014-05-29 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013264#comment-14013264
 ] 

Kan Zhang edited comment on SPARK-1866 at 5/30/14 6:19 AM:
---

Unfortunately this type of error will pop up whenever a closure references user 
objects (any objects other than nested closure objects) that are not 
serializable. Our current approach is we don't clone (or null) user objects 
since we can't be sure it is safe to do so (see commit 
f346e64637fa4f9bd95fcc966caa496bea5feca0). 

Spark shell (REPL) synthesizes a class for each line. In this case, the class 
for the closure line imports {{instances}} as a field (since the parser thinks 
it is referenced by this line) and the corresponding line object is referenced 
by the closure via {{x}}. 

My take on this is advising users to avoid name collisions as a workaround.


was (Author: kzhang):
Unfortunately this type of error will pop up whenever a closure references user 
objects (any objects other than nested closure objects) that are not 
serializable. Our current approach is we don't clone (or nulling) user objects 
since we can't be sure it is safe to do so (see commit 
f346e64637fa4f9bd95fcc966caa496bea5feca0). 

Spark shell (REPL) synthesizes a class for each line. In this case, the class 
for the closure line imports {{instances}} as a field (since the parser thinks 
it is referenced by this line) and the corresponding line object is referenced 
by the closure via {{x}}. 

My take on this is to advise users to avoid name collisions as a workaround.

> Closure cleaner does not null shadowed fields when outer scope is referenced
> 
>
> Key: SPARK-1866
> URL: https://issues.apache.org/jira/browse/SPARK-1866
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Assignee: Kan Zhang
>Priority: Critical
> Fix For: 1.1.0, 1.0.1
>
>
> Take the following example:
> {code}
> val x = 5
> val instances = new org.apache.hadoop.fs.Path("/") /* non-serializable */
> sc.parallelize(0 until 10).map { _ =>
>   val instances = 3
>   (instances, x)
> }.collect
> {code}
> This produces a "java.io.NotSerializableException: 
> org.apache.hadoop.fs.Path", despite the fact that the outer instances is not 
> actually used within the closure. If you change the name of the outer 
> variable instances to something else, the code executes correctly, indicating 
> that it is the fact that the two variables share a name that causes the issue.
> Additionally, if the outer scope is not used (i.e., we do not reference "x" 
> in the above example), the issue does not appear.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1866) Closure cleaner does not null shadowed fields when outer scope is referenced

2014-05-29 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013264#comment-14013264
 ] 

Kan Zhang commented on SPARK-1866:
--

Unfortunately this type of error will pop up whenever a closure references user 
objects (any objects other than nested closure objects) that are not 
serializable. Our current approach is we don't clone (or nulling) user objects 
since we can't be sure it is safe to do so (see commit 
f346e64637fa4f9bd95fcc966caa496bea5feca0). 

Spark shell (REPL) synthesizes a class for each line. In this case, the class 
for the closure line imports {{instances}} as a field (since the parser thinks 
it is referenced by this line) and the corresponding line object is referenced 
by the closure via {{x}}. 

My take on this is to advise users to avoid name collisions as a workaround.

> Closure cleaner does not null shadowed fields when outer scope is referenced
> 
>
> Key: SPARK-1866
> URL: https://issues.apache.org/jira/browse/SPARK-1866
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Assignee: Kan Zhang
>Priority: Critical
> Fix For: 1.1.0, 1.0.1
>
>
> Take the following example:
> {code}
> val x = 5
> val instances = new org.apache.hadoop.fs.Path("/") /* non-serializable */
> sc.parallelize(0 until 10).map { _ =>
>   val instances = 3
>   (instances, x)
> }.collect
> {code}
> This produces a "java.io.NotSerializableException: 
> org.apache.hadoop.fs.Path", despite the fact that the outer instances is not 
> actually used within the closure. If you change the name of the outer 
> variable instances to something else, the code executes correctly, indicating 
> that it is the fact that the two variables share a name that causes the issue.
> Additionally, if the outer scope is not used (i.e., we do not reference "x" 
> in the above example), the issue does not appear.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1519) support minPartitions parameter of wholeTextFiles() in pyspark

2014-05-21 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1519:
-

Fix Version/s: 1.0.1
   1.1.0

> support minPartitions parameter of wholeTextFiles() in pyspark
> --
>
> Key: SPARK-1519
> URL: https://issues.apache.org/jira/browse/SPARK-1519
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Nan Zhu
>Assignee: Kan Zhang
> Fix For: 1.1.0, 1.0.1
>
>
> though Scala implementation provides the parameter of minPartitions in 
> wholeTextFiles, PySpark hasn't support it, 
> should be easy to add in context.py



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1519) support minPartitions parameter of wholeTextFiles() in pyspark

2014-05-21 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang resolved SPARK-1519.
--

Resolution: Fixed

> support minPartitions parameter of wholeTextFiles() in pyspark
> --
>
> Key: SPARK-1519
> URL: https://issues.apache.org/jira/browse/SPARK-1519
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Nan Zhu
>Assignee: Kan Zhang
> Fix For: 1.1.0, 1.0.1
>
>
> though Scala implementation provides the parameter of minPartitions in 
> wholeTextFiles, PySpark hasn't support it, 
> should be easy to add in context.py



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1822) SchemaRDD.count() should use the optimizer.

2014-05-20 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003968#comment-14003968
 ] 

Kan Zhang commented on SPARK-1822:
--

PR: https://github.com/apache/spark/pull/841

> SchemaRDD.count() should use the optimizer.
> ---
>
> Key: SPARK-1822
> URL: https://issues.apache.org/jira/browse/SPARK-1822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Kan Zhang
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1817) RDD zip erroneous when partitions do not divide RDD count

2014-05-14 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998153#comment-13998153
 ] 

Kan Zhang commented on SPARK-1817:
--

I opened SPARK-1837 as a specific fix for the error reported in the description.

> RDD zip erroneous when partitions do not divide RDD count
> -
>
> Key: SPARK-1817
> URL: https://issues.apache.org/jira/browse/SPARK-1817
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Michael Malak
>Assignee: Kan Zhang
>Priority: Blocker
>
> Example:
> scala> sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect
> res1: Array[(Long, Int)] = Array((2,11))
> But more generally, it's whenever the number of partitions does not evenly 
> divide the total number of elements in the RDD.
> See https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-937) Executors that exit cleanly should not have KILLED status

2014-05-14 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13992631#comment-13992631
 ] 

Kan Zhang commented on SPARK-937:
-

Hi Aaron, are you still working on this one? If not, could you assign it to me? 
I have a PR for SPARK-1118 (closed as a duplicate of this JIRA) that I could 
re-sumit for this one. If you are still working on it or plan to, feel free to 
pick whatever might be useful to you https://github.com/apache/spark/pull/306

> Executors that exit cleanly should not have KILLED status
> -
>
> Key: SPARK-937
> URL: https://issues.apache.org/jira/browse/SPARK-937
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 0.7.3
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
>Priority: Minor
>
> This is an unintuitive and overloaded status message when Executors are 
> killed during normal termination of an application.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1817) RDD zip erroneous when partitions do not divide RDD count

2014-05-13 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1817:
-

Affects Version/s: 1.0.0

> RDD zip erroneous when partitions do not divide RDD count
> -
>
> Key: SPARK-1817
> URL: https://issues.apache.org/jira/browse/SPARK-1817
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Michael Malak
>Assignee: Kan Zhang
>Priority: Blocker
>
> Example:
> scala> sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect
> res1: Array[(Long, Int)] = Array((2,11))
> But more generally, it's whenever the number of partitions does not evenly 
> divide the total number of elements in the RDD.
> See https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1817) RDD zip erroneous when partitions do not divide RDD count

2014-05-13 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1817:
-

Component/s: Spark Core

> RDD zip erroneous when partitions do not divide RDD count
> -
>
> Key: SPARK-1817
> URL: https://issues.apache.org/jira/browse/SPARK-1817
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Michael Malak
>Assignee: Kan Zhang
>Priority: Blocker
>
> Example:
> scala> sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect
> res1: Array[(Long, Int)] = Array((2,11))
> But more generally, it's whenever the number of partitions does not evenly 
> divide the total number of elements in the RDD.
> See https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1817) RDD zip erroneous when partitions do not divide RDD count

2014-05-13 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996818#comment-13996818
 ] 

Kan Zhang commented on SPARK-1817:
--

PR: https://github.com/apache/spark/pull/760

> RDD zip erroneous when partitions do not divide RDD count
> -
>
> Key: SPARK-1817
> URL: https://issues.apache.org/jira/browse/SPARK-1817
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Michael Malak
>Assignee: Kan Zhang
>Priority: Blocker
>
> Example:
> scala> sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect
> res1: Array[(Long, Int)] = Array((2,11))
> But more generally, it's whenever the number of partitions does not evenly 
> divide the total number of elements in the RDD.
> See https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1161) Add saveAsObjectFile and SparkContext.objectFile in Python

2014-05-12 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996081#comment-13996081
 ] 

Kan Zhang commented on SPARK-1161:
--

PR: https://github.com/apache/spark/pull/755

> Add saveAsObjectFile and SparkContext.objectFile in Python
> --
>
> Key: SPARK-1161
> URL: https://issues.apache.org/jira/browse/SPARK-1161
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Matei Zaharia
>Assignee: Kan Zhang
>
> It can use pickling for serialization and a SequenceFile on disk similar to 
> the JVM versions of these.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1118) Executor state shows as KILLED even the application is finished normally

2014-05-12 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958329#comment-13958329
 ] 

Kan Zhang edited comment on SPARK-1118 at 5/9/14 6:55 PM:
--

PR: https://github.com/apache/spark/pull/306



was (Author: kzhang):
I took a look running SparkPi on my single node cluster (laptop). There seems 
to be 2 issues.

1. All the work was done in the first executor. When the job is done, driver 
asks the executor to shutdown. However, this clean exit was assigned FAILED 
executor state by Worker. I introduced EXITED executor state for executors who 
voluntarily exit (both normal and abnormal exit depending on the exit code).

2. When Master gets notified the first executor exited, it launched a second 
one, which is not needed and subsequently got killed when App disassociates. We 
could change the scheduler to tell Master the job is done so that Master 
wouldn't start the second executor. However, there is a race condition between 
App telling Master job is done and Worker telling Master the first executor 
exited. There is no guarantee the former will happen before the later. Instead, 
I chose to check the exit code when executor exits. If the exit code is 0, I 
assume executor has been asked to shutdown by driver and Master will not 
schedule new executors. This avoids the second executor being launched and 
consequently no executor is killed in Worker's log. However, it is still 
possible (although didn't happen on my local cluster), the first executor gets 
killed by Master, if Master detects App disassociation event before the first 
executor exited. The order of these events can't be guaranteed since they come 
from different paths. If an executor does get killed, I favor leaving its state 
as KILLED, even though the App state may be FINISHED.

Here's the PR. Pls let me know what else I can do.

https://github.com/apache/spark/pull/306


> Executor state shows as KILLED even the application is finished normally
> 
>
> Key: SPARK-1118
> URL: https://issues.apache.org/jira/browse/SPARK-1118
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Nan Zhu
>Assignee: Kan Zhang
>Priority: Minor
> Fix For: 1.0.0
>
>
> This seems weird, ExecutorState has no option of FINISHED, a terminated 
> executor can only be KILLED, FAILED, LOST



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1519) support minPartitions parameter of wholeTextFiles() in pyspark

2014-05-10 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13992980#comment-13992980
 ] 

Kan Zhang commented on SPARK-1519:
--

PR: https://github.com/apache/spark/pull/697

> support minPartitions parameter of wholeTextFiles() in pyspark
> --
>
> Key: SPARK-1519
> URL: https://issues.apache.org/jira/browse/SPARK-1519
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Nan Zhu
>
> though Scala implementation provides the parameter of minPartitions in 
> wholeTextFiles, PySpark hasn't support it, 
> should be easy to add in context.py



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (SPARK-1687) Support NamedTuples in RDDs

2014-05-05 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang reassigned SPARK-1687:


Assignee: Kan Zhang

> Support NamedTuples in RDDs
> ---
>
> Key: SPARK-1687
> URL: https://issues.apache.org/jira/browse/SPARK-1687
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.0.0
> Environment: Spark version 1.0.0-SNAPSHOT
> Python 2.7.5
>Reporter: Pat McDonough
>Assignee: Kan Zhang
>
> Add Support for NamedTuples in RDDs. Some sample code is below, followed by 
> the current error that comes from it.
> Based on a quick conversation with [~ahirreddy], 
> [Dill|https://github.com/uqfoundation/dill] might be a good solution here.
> {code}
> In [26]: from collections import namedtuple
> ...
> In [33]: Person = namedtuple('Person', 'id firstName lastName')
> In [34]: jon = Person(1, "Jon", "Doe")
> In [35]: jane = Person(2, "Jane", "Doe")
> In [36]: theDoes = sc.parallelize((jon, jane))
> In [37]: theDoes.collect()
> Out[37]: 
> [Person(id=1, firstName='Jon', lastName='Doe'),
>  Person(id=2, firstName='Jane', lastName='Doe')]
> In [38]: theDoes.count()
> PySpark worker failed with exception:
> PySpark worker failed with exception:
> Traceback (most recent call last):
>   File "/Users/pat/Projects/spark/python/pyspark/worker.py", line 77, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 1373, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 1373, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 283, in func
> def func(s, iterator): return f(iterator)
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 708, in 
> 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 708, in 
> 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/pat/Projects/spark/python/pyspark/serializers.py", line 129, 
> in load_stream
> yield self._read_with_length(stream)
>   File "/Users/pat/Projects/spark/python/pyspark/serializers.py", line 146, 
> in _read_with_length
> return self.loads(obj)
> AttributeError: 'module' object has no attribute 'Person'
> Traceback (most recent call last):
>   File "/Users/pat/Projects/spark/python/pyspark/worker.py", line 77, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 1373, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 1373, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 283, in func
> def func(s, iterator): return f(iterator)
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 708, in 
> 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 708, in 
> 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/pat/Projects/spark/python/pyspark/serializers.py", line 129, 
> in load_stream
> yield self._read_with_length(stream)
>   File "/Users/pat/Projects/spark/python/pyspark/serializers.py", line 146, 
> in _read_with_length
> return self.loads(obj)
> AttributeError: 'module' object has no attribute 'Person'
> 14/04/30 14:43:53 ERROR Executor: Exception in task ID 23
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/Users/pat/Projects/spark/python/pyspark/worker.py", line 77, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 1373, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 1373, in 
> pipeline_func
> return func(split, prev_func(split, iterator))
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 283, in func
> def func(s, iterator): return f(iterator)
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 708, in 
> 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/pat/Projects/spark/python/pyspark/rdd.py", line 708, in 
> 
> return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/pat/Projects/spark/python/pyspark/serializers.py", line 129, 
> in load_stream
> yield self._read_with_length(stream)
>   File "/Use

[jira] [Commented] (SPARK-1690) RDD.saveAsTextFile throws scala.MatchError if RDD contains empty elements

2014-05-04 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989310#comment-13989310
 ] 

Kan Zhang commented on SPARK-1690:
--

PR: https://github.com/apache/spark/pull/644

> RDD.saveAsTextFile throws scala.MatchError if RDD contains empty elements
> -
>
> Key: SPARK-1690
> URL: https://issues.apache.org/jira/browse/SPARK-1690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.0
> Environment: Linux/CentOS6, Spark 0.9.1, standalone mode against HDFS 
> from Hadoop 1.2.1
>Reporter: Glenn K. Lockwood
>Assignee: Kan Zhang
>Priority: Minor
>
> The following pyspark code fails with a scala.MatchError exception if 
> sample.txt contains any empty lines:
> file = sc.textFile('hdfs://gcn-3-45.ibnet0:54310/user/glock/sample.txt')
> file.saveAsTextFile('hdfs://gcn-3-45.ibnet0:54310/user/glock/sample.out')
> The resulting stack trace:
> 14/04/30 17:02:46 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/04/30 17:02:46 WARN scheduler.TaskSetManager: Loss was due to 
> scala.MatchError
> scala.MatchError: 0 (of class java.lang.Integer)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:129)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:119)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:112)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeToFile$1(PairRDDFunctions.scala:732)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:741)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:741)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
>   at org.apache.spark.scheduler.Task.run(Task.scala:53)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:722)
> This can be reproduced with a sample.txt containing
> """
> foo
> bar
> """
> and disappears if sample.txt is
> """
> foo
> bar
> """



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (SPARK-1690) RDD.saveAsTextFile throws scala.MatchError if RDD contains empty elements

2014-05-04 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang reassigned SPARK-1690:


Assignee: Kan Zhang

> RDD.saveAsTextFile throws scala.MatchError if RDD contains empty elements
> -
>
> Key: SPARK-1690
> URL: https://issues.apache.org/jira/browse/SPARK-1690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.0
> Environment: Linux/CentOS6, Spark 0.9.1, standalone mode against HDFS 
> from Hadoop 1.2.1
>Reporter: Glenn K. Lockwood
>Assignee: Kan Zhang
>Priority: Minor
>
> The following pyspark code fails with a scala.MatchError exception if 
> sample.txt contains any empty lines:
> file = sc.textFile('hdfs://gcn-3-45.ibnet0:54310/user/glock/sample.txt')
> file.saveAsTextFile('hdfs://gcn-3-45.ibnet0:54310/user/glock/sample.out')
> The resulting stack trace:
> 14/04/30 17:02:46 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/04/30 17:02:46 WARN scheduler.TaskSetManager: Loss was due to 
> scala.MatchError
> scala.MatchError: 0 (of class java.lang.Integer)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:129)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:119)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:112)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeToFile$1(PairRDDFunctions.scala:732)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:741)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:741)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
>   at org.apache.spark.scheduler.Task.run(Task.scala:53)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:722)
> This can be reproduced with a sample.txt containing
> """
> foo
> bar
> """
> and disappears if sample.txt is
> """
> foo
> bar
> """



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1641) Spark submit warning tells the user to use spark-submit

2014-04-26 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982109#comment-13982109
 ] 

Kan Zhang commented on SPARK-1641:
--

Andrew, I think this is a duplicate of SPARK-1534. 

I haven't gotten around to work on it. Feel free to take it. 

> Spark submit warning tells the user to use spark-submit
> ---
>
> Key: SPARK-1641
> URL: https://issues.apache.org/jira/browse/SPARK-1641
> Project: Spark
>  Issue Type: Improvement
>Reporter: Andrew Or
>Priority: Minor
>
> $ bin/spark-submit ...
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> WARNING: This client is deprecated and will be removed in a future version of 
> Spark.
> Use ./bin/spark-submit with "--master yarn"
> This is printed in org.apache.spark.deploy.yarn.Client.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1604) Couldn't run spark-submit with yarn cluster mode when built with assemble-deps

2014-04-25 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1604:
-

Description: 
{code}
SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
 ./bin/spark-submit 
./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
Exception in thread "main" java.lang.ClassNotFoundException: 
org.apache.spark.deploy.yarn.Client
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

  was:
SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
 ./bin/spark-submit 
./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
Exception in thread "main" java.lang.ClassNotFoundException: 
org.apache.spark.deploy.yarn.Client
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


> Couldn't run spark-submit with yarn cluster mode when built with assemble-deps
> --
>
> Key: SPARK-1604
> URL: https://issues.apache.org/jira/browse/SPARK-1604
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>
> {code}
> SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
>  ./bin/spark-submit 
> ./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
> yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
> Exception in thread "main" java.lang.ClassNotFoundException: 
> org.apache.spark.deploy.yarn.Client
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1604) Couldn't run spark-submit with yarn cluster mode when built with assemble-ceps

2014-04-25 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1604:
-

Summary: Couldn't run spark-submit with yarn cluster mode when built with 
assemble-ceps  (was: Couldn't run spark-submit with yarn cluster mode when 
using deps jar)

> Couldn't run spark-submit with yarn cluster mode when built with assemble-ceps
> --
>
> Key: SPARK-1604
> URL: https://issues.apache.org/jira/browse/SPARK-1604
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>
> SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
>  ./bin/spark-submit 
> ./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
> yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
> Exception in thread "main" java.lang.ClassNotFoundException: 
> org.apache.spark.deploy.yarn.Client
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1604) Couldn't run spark-submit with yarn cluster mode when built with assemble-deps

2014-04-25 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1604:
-

Summary: Couldn't run spark-submit with yarn cluster mode when built with 
assemble-deps  (was: Couldn't run spark-submit with yarn cluster mode when 
built with assemble-ceps)

> Couldn't run spark-submit with yarn cluster mode when built with assemble-deps
> --
>
> Key: SPARK-1604
> URL: https://issues.apache.org/jira/browse/SPARK-1604
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>
> SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
>  ./bin/spark-submit 
> ./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
> yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
> Exception in thread "main" java.lang.ClassNotFoundException: 
> org.apache.spark.deploy.yarn.Client
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1604) Couldn't run spark-submit with yarn cluster mode when using deps jar

2014-04-25 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13980783#comment-13980783
 ] 

Kan Zhang commented on SPARK-1604:
--

My bad. Just tested, the issue remains without including the assemble-deps jar.

Agree on examples jar not including Spark.

> Couldn't run spark-submit with yarn cluster mode when using deps jar
> 
>
> Key: SPARK-1604
> URL: https://issues.apache.org/jira/browse/SPARK-1604
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>
> SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
>  ./bin/spark-submit 
> ./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
> yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
> Exception in thread "main" java.lang.ClassNotFoundException: 
> org.apache.spark.deploy.yarn.Client
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1571) UnresolvedException when running JavaSparkSQL example

2014-04-24 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13980034#comment-13980034
 ] 

Kan Zhang commented on SPARK-1571:
--

Thanks, it worked.

> UnresolvedException when running JavaSparkSQL example
> -
>
> Key: SPARK-1571
> URL: https://issues.apache.org/jira/browse/SPARK-1571
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Assignee: Michael Armbrust
>Priority: Blocker
>
> When running JavaSparkSQL example using spark-submit in local mode (this 
> happens after fixing the class loading issue in SPARK-1570).
> 14/04/22 12:46:47 ERROR Executor: Exception in task ID 0
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: 'age
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:47)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.c2(Expression.scala:203)
>   at 
> org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual.eval(predicates.scala:142)
>   at 
> org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:84)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1604) Couldn't run spark-submit with yarn cluster mode when using deps jar

2014-04-24 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1604:
-

Priority: Major  (was: Blocker)

> Couldn't run spark-submit with yarn cluster mode when using deps jar
> 
>
> Key: SPARK-1604
> URL: https://issues.apache.org/jira/browse/SPARK-1604
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>
> SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
>  ./bin/spark-submit 
> ./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
> yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
> Exception in thread "main" java.lang.ClassNotFoundException: 
> org.apache.spark.deploy.yarn.Client
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1604) Couldn't run spark-submit with yarn cluster mode when using deps jar

2014-04-24 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13980003#comment-13980003
 ] 

Kan Zhang commented on SPARK-1604:
--

Sure, lowered it to Major.

> Couldn't run spark-submit with yarn cluster mode when using deps jar
> 
>
> Key: SPARK-1604
> URL: https://issues.apache.org/jira/browse/SPARK-1604
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>
> SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
>  ./bin/spark-submit 
> ./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
> yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
> Exception in thread "main" java.lang.ClassNotFoundException: 
> org.apache.spark.deploy.yarn.Client
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1604) Couldn't run spark-submit with yarn cluster mode when using deps jar

2014-04-24 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13979994#comment-13979994
 ] 

Kan Zhang edited comment on SPARK-1604 at 4/24/14 5:38 PM:
---

Ah, that could be the reason. I was using sbt assemble-deps and then package to 
build. Just verified, when building the normal sbt assembly jar, problem 
disappears. Could be an issue with the former build sequence. 

Moving this to BUILD.


was (Author: kzhang):
Ah, that could be the reason. I was using sbt assemble-deps and then package to 
build. Just verified, when building the normal sbt assembly jar, problem 
disappears. Could be a problem with the former build sequence. 

Moving this BUILD.

> Couldn't run spark-submit with yarn cluster mode when using deps jar
> 
>
> Key: SPARK-1604
> URL: https://issues.apache.org/jira/browse/SPARK-1604
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Priority: Blocker
>
> SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
>  ./bin/spark-submit 
> ./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
> yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
> Exception in thread "main" java.lang.ClassNotFoundException: 
> org.apache.spark.deploy.yarn.Client
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1604) Couldn't run spark-submit with yarn cluster mode when using deps jar

2014-04-24 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1604:
-

Summary: Couldn't run spark-submit with yarn cluster mode when using deps 
jar  (was: YARN cluster mode broken)

> Couldn't run spark-submit with yarn cluster mode when using deps jar
> 
>
> Key: SPARK-1604
> URL: https://issues.apache.org/jira/browse/SPARK-1604
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Priority: Blocker
>
> SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
>  ./bin/spark-submit 
> ./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
> yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
> Exception in thread "main" java.lang.ClassNotFoundException: 
> org.apache.spark.deploy.yarn.Client
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1604) YARN cluster mode broken

2014-04-24 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13979994#comment-13979994
 ] 

Kan Zhang commented on SPARK-1604:
--

Ah, that could be the reason. I was using sbt assemble-deps and then package to 
build. Just verified, when building the normal sbt assembly jar, problem 
disappears. Could be a problem with the former build sequence. 

Moving this BUILD.

> YARN cluster mode broken
> 
>
> Key: SPARK-1604
> URL: https://issues.apache.org/jira/browse/SPARK-1604
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Priority: Blocker
>
> SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
>  ./bin/spark-submit 
> ./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
> yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
> Exception in thread "main" java.lang.ClassNotFoundException: 
> org.apache.spark.deploy.yarn.Client
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1604) YARN cluster mode broken

2014-04-24 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1604:
-

Component/s: (was: YARN)
 Build

> YARN cluster mode broken
> 
>
> Key: SPARK-1604
> URL: https://issues.apache.org/jira/browse/SPARK-1604
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Priority: Blocker
>
> SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
>  ./bin/spark-submit 
> ./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
> yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
> Exception in thread "main" java.lang.ClassNotFoundException: 
> org.apache.spark.deploy.yarn.Client
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1604) YARN cluster mode broken

2014-04-24 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13979624#comment-13979624
 ] 

Kan Zhang commented on SPARK-1604:
--

I doubt it, since when I ran it in YARN client mode, it did work.

> YARN cluster mode broken
> 
>
> Key: SPARK-1604
> URL: https://issues.apache.org/jira/browse/SPARK-1604
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Priority: Blocker
>
> SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
>  ./bin/spark-submit 
> ./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
> yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
> Exception in thread "main" java.lang.ClassNotFoundException: 
> org.apache.spark.deploy.yarn.Client
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1604) YARN cluster mode broken

2014-04-23 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1604:
-

Affects Version/s: 1.0.0

> YARN cluster mode broken
> 
>
> Key: SPARK-1604
> URL: https://issues.apache.org/jira/browse/SPARK-1604
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>
> SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
>  ./bin/spark-submit 
> ./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
> yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
> Exception in thread "main" java.lang.ClassNotFoundException: 
> org.apache.spark.deploy.yarn.Client
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1604) YARN cluster mode broken

2014-04-23 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1604:
-

Priority: Blocker  (was: Major)

> YARN cluster mode broken
> 
>
> Key: SPARK-1604
> URL: https://issues.apache.org/jira/browse/SPARK-1604
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Priority: Blocker
>
> SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
>  ./bin/spark-submit 
> ./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
> yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
> Exception in thread "main" java.lang.ClassNotFoundException: 
> org.apache.spark.deploy.yarn.Client
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1604) YARN cluster mode broken

2014-04-23 Thread Kan Zhang (JIRA)

Kan Zhang created SPARK-1604:


 Summary: YARN cluster mode broken
 Key: SPARK-1604
 URL: https://issues.apache.org/jira/browse/SPARK-1604
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Kan Zhang


SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
 ./bin/spark-submit 
./examples/target/scala-2.10/spark-examples_2.10-1.0.0-SNAPSHOT.jar --master 
yarn --deploy-mode cluster --class org.apache.spark.examples.sql.JavaSparkSQL 
Exception in thread "main" java.lang.ClassNotFoundException: 
org.apache.spark.deploy.yarn.Client
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:234)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1570) Class loading issue when using Spark SQL Java API

2014-04-22 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang resolved SPARK-1570.
--

Resolution: Fixed

> Class loading issue when using Spark SQL Java API
> -
>
> Key: SPARK-1570
> URL: https://issues.apache.org/jira/browse/SPARK-1570
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>Priority: Blocker
> Fix For: 1.0.0
>
>
> ClassNotFoundException in Executor when running JavaSparkSQL example using 
> spark-submit in local mode.
> 14/04/22 12:26:20 ERROR Executor: Exception in task ID 0
> java.lang.ClassNotFoundException: 
> org.apache.spark.examples.sql.JavaSparkSQL.Person
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:190)
>   at 
> org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:90)
>   at 
> org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:88)
>   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
>   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1571) UnresolvedException when running JavaSparkSQL example

2014-04-22 Thread Kan Zhang (JIRA)

Kan Zhang created SPARK-1571:


 Summary: UnresolvedException when running JavaSparkSQL example
 Key: SPARK-1571
 URL: https://issues.apache.org/jira/browse/SPARK-1571
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Kan Zhang


When run JavaSparkSQL example using spark-submit in local mode (this happens 
after fixing the class loading issue in SPARK-1570).

14/04/22 12:46:47 ERROR Executor: Exception in task ID 0
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 'age
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:49)
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:47)
at 
org.apache.spark.sql.catalyst.expressions.Expression.c2(Expression.scala:203)
at 
org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual.eval(predicates.scala:142)
at 
org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:84)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1571) UnresolvedException when running JavaSparkSQL example

2014-04-22 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1571:
-

Description: 
When running JavaSparkSQL example using spark-submit in local mode (this 
happens after fixing the class loading issue in SPARK-1570).

14/04/22 12:46:47 ERROR Executor: Exception in task ID 0
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 'age
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:49)
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:47)
at 
org.apache.spark.sql.catalyst.expressions.Expression.c2(Expression.scala:203)
at 
org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual.eval(predicates.scala:142)
at 
org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:84)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)


  was:
When run JavaSparkSQL example using spark-submit in local mode (this happens 
after fixing the class loading issue in SPARK-1570).

14/04/22 12:46:47 ERROR Executor: Exception in task ID 0
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 'age
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:49)
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:47)
at 
org.apache.spark.sql.catalyst.expressions.Expression.c2(Expression.scala:203)
at 
org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual.eval(predicates.scala:142)
at 
org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:84)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)



> UnresolvedException when running JavaSparkSQL example
> -
>
> Key: SPARK-1571
> URL: https://issues.apache.org/jira/browse/SPARK-1571
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>
> When running JavaSparkSQL example using spark-submit in local mode (this 
> happens after fixing the class loading issue in SPARK-1570).
> 14/04/22 12:46:47 ERROR Executor: Exception in task ID 0
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: 'age
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:47)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.c2(Expression.scala:203)
>   at 
> org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual.eval(predicates.scala:142)
>   at 
> org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:84)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1570) Class loading issue when using Spark SQL Java API

2014-04-22 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1570:
-

Description: 
ClassNotFoundException in Executor when running JavaSparkSQL example using 
spark-submit in local mode.

14/04/22 12:26:20 ERROR Executor: Exception in task ID 0
java.lang.ClassNotFoundException: 
org.apache.spark.examples.sql.JavaSparkSQL.Person
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:90)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:88)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)


  was:
ClassNotFoundException in Executor when running JavaSparkSQL example using 
spark-submit.

14/04/22 12:26:20 ERROR Executor: Exception in task ID 0
java.lang.ClassNotFoundException: 
org.apache.spark.examples.sql.JavaSparkSQL.Person
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:90)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:88)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)



> Class loading issue when using Spark SQL Java API
> -
>
> Key: SPARK-1570
> URL: https://issues.apache.org/jira/browse/SPARK-1570
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>Priority: Blocker
> Fix For: 1.0.0
>
>
> ClassNotFoundException in Executor when running JavaSparkSQL example using 
> spark-submit in local mode.
> 14/04/22 12:26:20 ERROR Executor: Exception in task ID 0
> java.lang.ClassNotFoundException: 
> org.apache.spark.examples.sql.JavaSparkSQL.Person
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:190)
>   at 
> org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:90)
>   at 
> org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:88)
>   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
>   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1570) Class loading issue when using Spark SQL Java API

2014-04-22 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977279#comment-13977279
 ] 

Kan Zhang commented on SPARK-1570:
--

PR: https://github.com/apache/spark/pull/484

> Class loading issue when using Spark SQL Java API
> -
>
> Key: SPARK-1570
> URL: https://issues.apache.org/jira/browse/SPARK-1570
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>Priority: Blocker
> Fix For: 1.0.0
>
>
> ClassNotFoundException in Executor when running JavaSparkSQL example using 
> spark-submit.
> 14/04/22 12:26:20 ERROR Executor: Exception in task ID 0
> java.lang.ClassNotFoundException: 
> org.apache.spark.examples.sql.JavaSparkSQL.Person
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:190)
>   at 
> org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:90)
>   at 
> org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:88)
>   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
>   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1570) Class loading issue when using Spark SQL Java API

2014-04-22 Thread Kan Zhang (JIRA)

Kan Zhang created SPARK-1570:


 Summary: Class loading issue when using Spark SQL Java API
 Key: SPARK-1570
 URL: https://issues.apache.org/jira/browse/SPARK-1570
 Project: Spark
  Issue Type: Bug
  Components: Java API, SQL
Affects Versions: 1.0.0
Reporter: Kan Zhang
Assignee: Kan Zhang
Priority: Blocker
 Fix For: 1.0.0


ClassNotFoundException in Executor when running JavaSparkSQL example using 
spark-submit.

14/04/22 12:26:20 ERROR Executor: Exception in task ID 0
java.lang.ClassNotFoundException: 
org.apache.spark.examples.sql.JavaSparkSQL.Person
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:90)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:88)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1480) Choose classloader consistently inside of Spark codebase

2014-04-22 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977064#comment-13977064
 ] 

Kan Zhang commented on SPARK-1480:
--

[~pwendell] do you mind posting a link to the PR? Thx.

> Choose classloader consistently inside of Spark codebase
> 
>
> Key: SPARK-1480
> URL: https://issues.apache.org/jira/browse/SPARK-1480
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
> Fix For: 1.0.0
>
>
> The Spark codebase is not always consistent on which class loader it uses 
> when classlaoders are explicitly passed to things like serializers. This 
> caused SPARK-1403 and also causes a bug where when the driver has a modified 
> context class loader it is not translated correctly in local mode to the 
> (local) executor.
> In most cases what we want is the following behavior:
> 1. If there is a context classloader on the thread, use that.
> 2. Otherwise use the classloader that loaded Spark.
> We should just have a utility function for this and call that function 
> whenever we need to get a classloader.
> Note that SPARK-1403 is a workaround for this exact problem (it sets the 
> context class loader because downstream code assumes it is set). Once this 
> gets fixed in a more general way SPARK-1403 can be reverted.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (SPARK-1534) spark-submit for yarn prints warnings even though calling as expected

2014-04-19 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang reassigned SPARK-1534:


Assignee: Kan Zhang

> spark-submit for yarn prints warnings even though calling as expected 
> --
>
> Key: SPARK-1534
> URL: https://issues.apache.org/jira/browse/SPARK-1534
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Thomas Graves
>Assignee: Kan Zhang
>
> I am calling spark-submit to submit application to spark on yarn (cluster 
> mode) and it is still printing warnings:
> $ ./bin/spark-submit  
> examples/target/scala-2.10/spark-examples_2.10-assembly-1.0.0-SNAPSHOT.jar  
> --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi 
> --arg yarn-cluster --properties-file ./spark-conf.properties 
> WARNING: This client is deprecated and will be removed in a future version of 
> Spark.
> Use ./bin/spark-submit with "--master yarn"
> --args is deprecated. Use --arg instead.
> The --args is deprecated is coming out because SparkSubmit itself needs to be 
> updated to --arg. 
> Similarly I think the Client.scala class for yarn needs to have the "Use 
> ./bin/spark-submit with "--master yarn"" warning removed since SparkSubmit 
> also calls it directly.
> I think the last one was supposed to warn users using spark-class directly. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1475) Draining event logging queue before stopping event logger

2014-04-16 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967228#comment-13967228
 ] 

Kan Zhang edited comment on SPARK-1475 at 4/17/14 2:08 AM:
---

When event queue is not drained, users may observe similar issues as those 
reported in SPARK-1407 (when sc.stop() is not called). 

https://github.com/apache/spark/pull/366

The above PR fixes this issue. It does require applications to call sc.stop() 
to properly stop SparkListenerBus and event logger.


was (Author: kzhang):
When event queue is not drained, users may observe similar issues as those 
reported in SPARK-1407 (when sc.stop() is not called). 

https://github.com/apache/spark/pull/366

The above PR fixes this issue. It does require applications to call 
SparkContext.stop() to properly stop SparkListenerBus and event logger.

> Draining event logging queue before stopping event logger
> -
>
> Key: SPARK-1475
> URL: https://issues.apache.org/jira/browse/SPARK-1475
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>Priority: Blocker
> Fix For: 1.0.0
>
>
> When stopping SparkListenerBus, its event queue needs to be drained. And this 
> needs to happen before event logger is stopped. Otherwise, any event still 
> waiting to be processed in the queue may be lost and consequently event log 
> file may be incomplete. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1475) Draining event logging queue before stopping event logger

2014-04-16 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13972225#comment-13972225
 ] 

Kan Zhang commented on SPARK-1475:
--

A second PR that fixes the unit test introduced above.

https://github.com/apache/spark/pull/401

> Draining event logging queue before stopping event logger
> -
>
> Key: SPARK-1475
> URL: https://issues.apache.org/jira/browse/SPARK-1475
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>Priority: Blocker
> Fix For: 1.0.0
>
>
> When stopping SparkListenerBus, its event queue needs to be drained. And this 
> needs to happen before event logger is stopped. Otherwise, any event still 
> waiting to be processed in the queue may be lost and consequently event log 
> file may be incomplete. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1475) Draining event logging queue before stopping event logger

2014-04-15 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1475:
-

Affects Version/s: (was: 1.0.0)

> Draining event logging queue before stopping event logger
> -
>
> Key: SPARK-1475
> URL: https://issues.apache.org/jira/browse/SPARK-1475
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>Priority: Blocker
> Fix For: 1.0.0
>
>
> When stopping SparkListenerBus, its event queue needs to be drained. And this 
> needs to happen before event logger is stopped. Otherwise, any event still 
> waiting to be processed in the queue may be lost and consequently event log 
> file may be incomplete. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1475) Draining event logging queue before stopping event logger

2014-04-15 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1475:
-

Affects Version/s: 1.0.0

> Draining event logging queue before stopping event logger
> -
>
> Key: SPARK-1475
> URL: https://issues.apache.org/jira/browse/SPARK-1475
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>Priority: Blocker
> Fix For: 1.0.0
>
>
> When stopping SparkListenerBus, its event queue needs to be drained. And this 
> needs to happen before event logger is stopped. Otherwise, any event still 
> waiting to be processed in the queue may be lost and consequently event log 
> file may be incomplete. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1475) Draining event logging queue before stopping event logger

2014-04-11 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1475:
-

Description: When stopping SparkListenerBus, its event queue needs to be 
drained. And this needs to happen before event logger is stopped. Otherwise, 
any event still waiting to be processed in the queue may be lost and 
consequently event log file may be incomplete.   (was: When stopping 
SparkListenerBus, its event queue needs to be drained. And this needs to happen 
before event logger is stopped. Otherwise, any events still waiting to be 
processed in the queue will be lost and consequently event log file may be 
incomplete. )

> Draining event logging queue before stopping event logger
> -
>
> Key: SPARK-1475
> URL: https://issues.apache.org/jira/browse/SPARK-1475
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>Priority: Blocker
> Fix For: 1.0.0
>
>
> When stopping SparkListenerBus, its event queue needs to be drained. And this 
> needs to happen before event logger is stopped. Otherwise, any event still 
> waiting to be processed in the queue may be lost and consequently event log 
> file may be incomplete. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1475) Draining event logging queue before stopping event logger

2014-04-11 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1475:
-

Description: When stopping SparkListenerBus, its event queue needs to be 
drained. And this needs to happen before event logger is stopped. Otherwise, 
any events still waiting to be processed in the queue will be lost and 
consequently event log file may be incomplete.   (was: When SparkListenerBus 
thread is stopped, its event queue needs to be drained. And this needs to 
happen before event logger is stopped. Otherwise, any events still waiting to 
be processed in queue will be lost and consequently event log file may be 
incomplete. )

> Draining event logging queue before stopping event logger
> -
>
> Key: SPARK-1475
> URL: https://issues.apache.org/jira/browse/SPARK-1475
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>Priority: Blocker
> Fix For: 1.0.0
>
>
> When stopping SparkListenerBus, its event queue needs to be drained. And this 
> needs to happen before event logger is stopped. Otherwise, any events still 
> waiting to be processed in the queue will be lost and consequently event log 
> file may be incomplete. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1407) EventLogging to HDFS doesn't work properly on yarn

2014-04-11 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967234#comment-13967234
 ] 

Kan Zhang commented on SPARK-1407:
--

I opened SPARK-1475 to track the above PR.

> EventLogging to HDFS doesn't work properly on yarn
> --
>
> Key: SPARK-1407
> URL: https://issues.apache.org/jira/browse/SPARK-1407
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Blocker
>
> When running on spark on yarn and accessing an HDFS file (like in the 
> SparkHdfsLR example) while using the event logging configured to write logs 
> to HDFS, it throws an exception at the end of the application. 
> SPARK_JAVA_OPTS=-Dspark.eventLog.enabled=true 
> -Dspark.eventLog.dir=hdfs:///history/spark/
> 14/04/03 13:41:31 INFO yarn.ApplicationMaster$$anon$1: Invoking sc stop from 
> shutdown hook
> Exception in thread "Thread-41" java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:398)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1465)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.sync(DFSOutputStream.java:1450)
> at 
> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:116)
> at 
> org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:137)
> at 
> org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:137)
> at scala.Option.foreach(Option.scala:236)
> at org.apache.spark.util.FileLogger.flush(FileLogger.scala:137)
> at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:69)
> at 
> org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:101)
> at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$13.apply(SparkListenerBus.scala:67)
> at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$13.apply(SparkListenerBus.scala:67)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:67)
> at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:31)
> at 
> org.apache.spark.scheduler.LiveListenerBus.post(LiveListenerBus.scala:78)
> at 
> org.apache.spark.SparkContext.postApplicationEnd(SparkContext.scala:1081)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:828)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:460)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1475) Draining event logging queue before stopping event logger

2014-04-11 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang resolved SPARK-1475.
--

Resolution: Fixed

> Draining event logging queue before stopping event logger
> -
>
> Key: SPARK-1475
> URL: https://issues.apache.org/jira/browse/SPARK-1475
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>Priority: Blocker
> Fix For: 1.0.0
>
>
> When SparkListenerBus thread is stopped, its event queue needs to be drained. 
> And this needs to happen before event logger is stopped. Otherwise, any 
> events still waiting to be processed in queue will be lost and consequently 
> event log file may be incomplete. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1475) Draining event logging queue before stopping event logger

2014-04-11 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967228#comment-13967228
 ] 

Kan Zhang commented on SPARK-1475:
--

When event queue is not drained, users may observe similar issues as those 
reported in SPARK-1407 (when sc.stop() is not called). 

https://github.com/apache/spark/pull/366

The above PR fixes this issue. It does require applications to call 
SparkContext.stop() to properly stop SparkListenerBus and event logger.

> Draining event logging queue before stopping event logger
> -
>
> Key: SPARK-1475
> URL: https://issues.apache.org/jira/browse/SPARK-1475
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>Priority: Blocker
> Fix For: 1.0.0
>
>
> When SparkListenerBus thread is stopped, its event queue needs to be drained. 
> And this needs to happen before event logger is stopped. Otherwise, any 
> events still waiting to be processed in queue will be lost and consequently 
> event log file may be incomplete. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1475) Draining event logging queue before stopping event logger

2014-04-11 Thread Kan Zhang (JIRA)

Kan Zhang created SPARK-1475:


 Summary: Draining event logging queue before stopping event logger
 Key: SPARK-1475
 URL: https://issues.apache.org/jira/browse/SPARK-1475
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Kan Zhang
Assignee: Kan Zhang
Priority: Blocker
 Fix For: 1.0.0


When SparkListenerBus thread is stopped, its event queue needs to be drained. 
And this needs to happen before event logger is stopped. Otherwise, any events 
still waiting to be processed in queue will be lost and consequently event log 
file may be incomplete. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (SPARK-1460) Set operations on SchemaRDDs are needlessly destructive of schema information.

2014-04-10 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang reassigned SPARK-1460:


Assignee: Kan Zhang

> Set operations on SchemaRDDs are needlessly destructive of schema information.
> --
>
> Key: SPARK-1460
> URL: https://issues.apache.org/jira/browse/SPARK-1460
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Kan Zhang
> Fix For: 1.1.0
>
>
> When you do a distinct of a subtract, you get back a normal RDD instead of a 
> schema RDD, even though the schema is unchanged.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1407) EventLogging to HDFS doesn't work properly on yarn

2014-04-09 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964867#comment-13964867
 ] 

Kan Zhang commented on SPARK-1407:
--

One example of the exception I encountered. Note the exact events (onJobEnd in 
this case) could be different.

Exception in thread "SparkListenerBus" java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:702)
at 
org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1832)
at 
org.apache.hadoop.hdfs.DFSOutputStream.hsync(DFSOutputStream.java:1815)
at 
org.apache.hadoop.hdfs.DFSOutputStream.hsync(DFSOutputStream.java:1798)
at 
org.apache.hadoop.fs.FSDataOutputStream.hsync(FSDataOutputStream.java:123)
at 
org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:138)
at 
org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:138)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.util.FileLogger.flush(FileLogger.scala:138)
at 
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:64)
at 
org.apache.spark.scheduler.EventLoggingListener.onJobEnd(EventLoggingListener.scala:86)
at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$4.apply(SparkListenerBus.scala:49)
at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$4.apply(SparkListenerBus.scala:49)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:49)
at 
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:31)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:61)


> EventLogging to HDFS doesn't work properly on yarn
> --
>
> Key: SPARK-1407
> URL: https://issues.apache.org/jira/browse/SPARK-1407
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Blocker
>
> When running on spark on yarn and accessing an HDFS file (like in the 
> SparkHdfsLR example) while using the event logging configured to write logs 
> to HDFS, it throws an exception at the end of the application. 
> SPARK_JAVA_OPTS=-Dspark.eventLog.enabled=true 
> -Dspark.eventLog.dir=hdfs:///history/spark/
> 14/04/03 13:41:31 INFO yarn.ApplicationMaster$$anon$1: Invoking sc stop from 
> shutdown hook
> Exception in thread "Thread-41" java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:398)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1465)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.sync(DFSOutputStream.java:1450)
> at 
> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:116)
> at 
> org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:137)
> at 
> org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:137)
> at scala.Option.foreach(Option.scala:236)
> at org.apache.spark.util.FileLogger.flush(FileLogger.scala:137)
> at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:69)
> at 
> org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:101)
> at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$13.apply(SparkListenerBus.scala:67)
> at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$13.apply(SparkListenerBus.scala:67)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:67)
> at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:31)
> at 
> org.apache.spark.scheduler.LiveListenerBus.post(LiveListenerBus.scala:78)
> at 
> org.apache.spark.SparkContext.postApplicationEnd(SparkContext.scala:1081)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:828)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:460)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1407) EventLogging to HDFS doesn't work properly on yarn

2014-04-08 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13963809#comment-13963809
 ] 

Kan Zhang commented on SPARK-1407:
--

Made some change to drain SparkListenerBus' event queue before flushing the 
buffer of FileLogger and stopping it. This does require sc.stop() to be called 
for it work, though.

https://github.com/apache/spark/pull/366

> EventLogging to HDFS doesn't work properly on yarn
> --
>
> Key: SPARK-1407
> URL: https://issues.apache.org/jira/browse/SPARK-1407
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Blocker
>
> When running on spark on yarn and accessing an HDFS file (like in the 
> SparkHdfsLR example) while using the event logging configured to write logs 
> to HDFS, it throws an exception at the end of the application. 
> SPARK_JAVA_OPTS=-Dspark.eventLog.enabled=true 
> -Dspark.eventLog.dir=hdfs:///history/spark/
> 14/04/03 13:41:31 INFO yarn.ApplicationMaster$$anon$1: Invoking sc stop from 
> shutdown hook
> Exception in thread "Thread-41" java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:398)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1465)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.sync(DFSOutputStream.java:1450)
> at 
> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:116)
> at 
> org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:137)
> at 
> org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:137)
> at scala.Option.foreach(Option.scala:236)
> at org.apache.spark.util.FileLogger.flush(FileLogger.scala:137)
> at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:69)
> at 
> org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:101)
> at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$13.apply(SparkListenerBus.scala:67)
> at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$13.apply(SparkListenerBus.scala:67)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:67)
> at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:31)
> at 
> org.apache.spark.scheduler.LiveListenerBus.post(LiveListenerBus.scala:78)
> at 
> org.apache.spark.SparkContext.postApplicationEnd(SparkContext.scala:1081)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:828)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:460)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1407) EventLogging to HDFS doesn't work properly on yarn

2014-04-08 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13963688#comment-13963688
 ] 

Kan Zhang commented on SPARK-1407:
--

I encountered similar issue. It is not limited to YARN, but shows up in 
Standalone mode as well. When System.exit is invoked without calling sc.stop(), 
Hadoop FileSystem's shutdown hook gets called, which closes fs without flushing 
client buffer and event log file got truncated. Calling sc.stop() seems to work 
in my test, which does flush hadoopDataStream buffer before closing. However, 
this may not completely solve the problem as events are written to buffer 
asynchronously by SparkListenerBus thread.

> EventLogging to HDFS doesn't work properly on yarn
> --
>
> Key: SPARK-1407
> URL: https://issues.apache.org/jira/browse/SPARK-1407
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Blocker
>
> When running on spark on yarn and accessing an HDFS file (like in the 
> SparkHdfsLR example) while using the event logging configured to write logs 
> to HDFS, it throws an exception at the end of the application. 
> SPARK_JAVA_OPTS=-Dspark.eventLog.enabled=true 
> -Dspark.eventLog.dir=hdfs:///history/spark/
> 14/04/03 13:41:31 INFO yarn.ApplicationMaster$$anon$1: Invoking sc stop from 
> shutdown hook
> Exception in thread "Thread-41" java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:398)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1465)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.sync(DFSOutputStream.java:1450)
> at 
> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:116)
> at 
> org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:137)
> at 
> org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:137)
> at scala.Option.foreach(Option.scala:236)
> at org.apache.spark.util.FileLogger.flush(FileLogger.scala:137)
> at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:69)
> at 
> org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:101)
> at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$13.apply(SparkListenerBus.scala:67)
> at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$13.apply(SparkListenerBus.scala:67)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:67)
> at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:31)
> at 
> org.apache.spark.scheduler.LiveListenerBus.post(LiveListenerBus.scala:78)
> at 
> org.apache.spark.SparkContext.postApplicationEnd(SparkContext.scala:1081)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:828)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:460)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1348) Spark UI's do not bind to localhost interface anymore

2014-04-03 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959466#comment-13959466
 ] 

Kan Zhang commented on SPARK-1348:
--

JettyUtils.startJettyServer() used to bind to all interfaces, however, 
SPARK-1060 change that to only bind to a specific interface (preferably a 
non-loopback address).

If you want to revert to previous behavior, here's the patch.

https://github.com/apache/spark/pull/318

> Spark UI's do not bind to localhost interface anymore
> -
>
> Key: SPARK-1348
> URL: https://issues.apache.org/jira/browse/SPARK-1348
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Priority: Blocker
> Fix For: 1.0.0
>
>
> When running the shell or standalone master, it no longer binds to localhost. 
> I think this may have been caused by the security patch. We should figure out 
> what caused it and revert to the old behavior. Maybe we want to always bind 
> to `localhost` or just to bind to all interfaces.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1118) Executor state shows as KILLED even the application is finished normally

2014-04-02 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958329#comment-13958329
 ] 

Kan Zhang commented on SPARK-1118:
--

I took a look running SparkPi on my single node cluster (laptop). There seems 
to be 2 issues.

1. All the work was done in the first executor. When the job is done, driver 
asks the executor to shutdown. However, this clean exit was assigned FAILED 
executor state by Worker. I introduced EXITED executor state for executors who 
voluntarily exit (both normal and abnormal exit depending on the exit code).

2. When Master gets notified the first executor exited, it launched a second 
one, which is not needed and subsequently got killed when App disassociates. We 
could change the scheduler to tell Master the job is done so that Master 
wouldn't start the second executor. However, there is a race condition between 
App telling Master job is done and Worker telling Master the first executor 
exited. There is no guarantee the former will happen before the later. Instead, 
I chose to check the exit code when executor exits. If the exit code is 0, I 
assume executor has been asked to shutdown by driver and Master will not 
schedule new executors. This avoids the second executor being launched and 
consequently no executor is killed in Worker's log. However, it is still 
possible (although didn't happen on my local cluster), the first executor gets 
killed by Master, if Master detects App disassociation event before the first 
executor exited. The order of these events can't be guaranteed since they come 
from different paths. If an executor does get killed, I favor leaving its state 
as KILLED, even though the App state may be FINISHED.

Here's the PR. Pls let me know what else I can do.

https://github.com/apache/spark/pull/306


> Executor state shows as KILLED even the application is finished normally
> 
>
> Key: SPARK-1118
> URL: https://issues.apache.org/jira/browse/SPARK-1118
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Nan Zhu
> Fix For: 1.0.0
>
>
> This seems weird, ExecutorState has no option of FINISHED, a terminated 
> executor can only be KILLED, FAILED, LOST



--
This message was sent by Atlassian JIRA
(v6.2#6252)

93 matches

Mail list logo