[jira] [Commented] (SPARK-7422) Add argmax to Vector, SparseVector

2015-05-12 Thread George Dittmar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539314#comment-14539314
 ] 

George Dittmar commented on SPARK-7422:
---

Finishing tests for this JIRA with PR inbound soon.

 Add argmax to Vector, SparseVector
 --

 Key: SPARK-7422
 URL: https://issues.apache.org/jira/browse/SPARK-7422
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 DenseVector has an argmax method which is currently private to Spark.  It 
 would be nice to add that method to Vector and SparseVector.  Adding it to 
 SparseVector would require being careful about handling the inactive elements 
 correctly and efficiently.
 We should make argmax public and add unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7423) spark.ml Classifier predict should not convert vectors to dense format

2015-05-12 Thread George Dittmar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539311#comment-14539311
 ] 

George Dittmar commented on SPARK-7423:
---

Will have a pr for this soon. Just made the changes to the linked JIRA in 
another branch and finishing up tests now.

 spark.ml Classifier predict should not convert vectors to dense format
 --

 Key: SPARK-7423
 URL: https://issues.apache.org/jira/browse/SPARK-7423
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 spark.ml.classification.ClassificationModel and 
 ProbabilisticClassificationModel both use DenseVector.argmax to implement 
 prediction (computing the prediction from the rawPrediction or probability 
 Vectors).  It would be best to implement argmax for Vector and SparseVector 
 and use Vector.argmax, rather than converting Vectors to dense format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7562) Improve error reporting for expression data type mismatch

2015-05-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7562:
---
Description: 
There is currently no error reporting for expression data types in analysis (we 
rely on resolved for that, which doesn't provide great error messages for 
types). It would be great to have that in checkAnalysis.

Ideally, it should be the responsibility of each Expression itself to specify 
the types it requires, and report errors that way. We would need to define a 
simple interface for that so each Expression can implement. The default 
implementation can just use the information provided by 
ExpectsInputTypes.expectedChildTypes. 



  was:
There is currently no error reporting for expression data types in analysis. It 
would be great to have that in checkAnalysis.

Ideally, it should be the responsibility of each Expression itself to specify 
the types it requires, and report errors that way. We would need to define a 
simple interface for that so each Expression can implement. The default 
implementation can just use the information provided by 
ExpectsInputTypes.expectedChildTypes. 




 Improve error reporting for expression data type mismatch
 -

 Key: SPARK-7562
 URL: https://issues.apache.org/jira/browse/SPARK-7562
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin

 There is currently no error reporting for expression data types in analysis 
 (we rely on resolved for that, which doesn't provide great error messages 
 for types). It would be great to have that in checkAnalysis.
 Ideally, it should be the responsibility of each Expression itself to specify 
 the types it requires, and report errors that way. We would need to define a 
 simple interface for that so each Expression can implement. The default 
 implementation can just use the information provided by 
 ExpectsInputTypes.expectedChildTypes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7562) Improve error reporting for expression data type mismatch

2015-05-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7562:
---
Description: 
There is currently no error reporting for expression data types in analysis (we 
rely on resolved for that, which doesn't provide great error messages for 
types). It would be great to have that in checkAnalysis.

Ideally, it should be the responsibility of each Expression itself to specify 
the types it requires, and report errors that way. We would need to define a 
simple interface for that so each Expression can implement. The default 
implementation can just use the information provided by 
ExpectsInputTypes.expectedChildTypes. 

cc [~marmbrus] what we discussed offline today.

  was:
There is currently no error reporting for expression data types in analysis (we 
rely on resolved for that, which doesn't provide great error messages for 
types). It would be great to have that in checkAnalysis.

Ideally, it should be the responsibility of each Expression itself to specify 
the types it requires, and report errors that way. We would need to define a 
simple interface for that so each Expression can implement. The default 
implementation can just use the information provided by 
ExpectsInputTypes.expectedChildTypes. 




 Improve error reporting for expression data type mismatch
 -

 Key: SPARK-7562
 URL: https://issues.apache.org/jira/browse/SPARK-7562
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin

 There is currently no error reporting for expression data types in analysis 
 (we rely on resolved for that, which doesn't provide great error messages 
 for types). It would be great to have that in checkAnalysis.
 Ideally, it should be the responsibility of each Expression itself to specify 
 the types it requires, and report errors that way. We would need to define a 
 simple interface for that so each Expression can implement. The default 
 implementation can just use the information provided by 
 ExpectsInputTypes.expectedChildTypes. 
 cc [~marmbrus] what we discussed offline today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7500) DAG visualization: cluster name bleeds beyond the cluster

2015-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539501#comment-14539501
 ] 

Apache Spark commented on SPARK-7500:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/6076

 DAG visualization: cluster name bleeds beyond the cluster
 -

 Key: SPARK-7500
 URL: https://issues.apache.org/jira/browse/SPARK-7500
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Minor
 Attachments: long names.png


 This happens only for long names. See screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7500) DAG visualization: cluster name bleeds beyond the cluster

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7500:
---

Assignee: Apache Spark  (was: Andrew Or)

 DAG visualization: cluster name bleeds beyond the cluster
 -

 Key: SPARK-7500
 URL: https://issues.apache.org/jira/browse/SPARK-7500
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Apache Spark
Priority: Minor
 Attachments: long names.png


 This happens only for long names. See screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7556) User guide update for feature transformer: Binarizer

2015-05-12 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539382#comment-14539382
 ] 

Liang-Chi Hsieh commented on SPARK-7556:


OK.

 User guide update for feature transformer: Binarizer
 

 Key: SPARK-7556
 URL: https://issues.apache.org/jira/browse/SPARK-7556
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Liang-Chi Hsieh

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2018) Big-Endian (IBM Power7) Spark Serialization issue

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2018:
---

Assignee: (was: Apache Spark)

 Big-Endian (IBM Power7)  Spark Serialization issue
 --

 Key: SPARK-2018
 URL: https://issues.apache.org/jira/browse/SPARK-2018
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
 Environment: hardware : IBM Power7
 OS:Linux version 2.6.32-358.el6.ppc64 
 (mockbu...@ppc-017.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
 Hat 4.4.7-3) (GCC) ) #1 SMP Tue Jan 29 11:43:27 EST 2013
 JDK: Java(TM) SE Runtime Environment (build pxp6470sr5-20130619_01(SR5))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux ppc64-64 Compressed References 
 20130617_152572 (JIT enabled, AOT enabled)
 Hadoop:Hadoop-0.2.3-CDH5.0
 Spark:Spark-1.0.0 or Spark-0.9.1
 spark-env.sh:
 export JAVA_HOME=/opt/ibm/java-ppc64-70/
 export SPARK_MASTER_IP=9.114.34.69
 export SPARK_WORKER_MEMORY=1m
 export SPARK_CLASSPATH=/home/test1/spark-1.0.0-bin-hadoop2/lib
 export  STANDALONE_SPARK_MASTER_HOST=9.114.34.69
 #export SPARK_JAVA_OPTS=' -Xdebug 
 -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n '
Reporter: Yanjie Gao

 We have an application run on Spark on Power7 System .
 But we meet an important issue about serialization.
 The example HdfsWordCount can meet the problem.
 ./bin/run-example  org.apache.spark.examples.streaming.HdfsWordCount 
 localdir
 We used Power7 (Big-Endian arch) and Redhat  6.4.
 Big-Endian  is the main cause since the example ran successfully in another 
 Power-based Little Endian setup.
 here is the exception stack and log:
 Spark Executor Command: /opt/ibm/java-ppc64-70//bin/java -cp 
 /home/test1/spark-1.0.0-bin-hadoop2/lib::/home/test1/src/spark-1.0.0-bin-hadoop2/conf:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/
  -XX:MaxPermSize=128m  -Xdebug 
 -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n -Xms512M 
 -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend 
 akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler 2 
 p7hvs7br16 4 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker 
 app-20140604023054-
 
 14/06/04 02:31:20 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 14/06/04 02:31:21 INFO spark.SecurityManager: Changing view acls to: 
 test1,yifeng
 14/06/04 02:31:21 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(test1, yifeng)
 14/06/04 02:31:22 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/06/04 02:31:22 INFO Remoting: Starting remoting
 14/06/04 02:31:22 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkExecutor@p7hvs7br16:39658]
 14/06/04 02:31:22 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://sparkExecutor@p7hvs7br16:39658]
 14/06/04 02:31:22 INFO executor.CoarseGrainedExecutorBackend: Connecting to 
 driver: akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler
 14/06/04 02:31:22 INFO worker.WorkerWatcher: Connecting to worker 
 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker
 14/06/04 02:31:23 INFO worker.WorkerWatcher: Successfully connected to 
 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker
 14/06/04 02:31:24 INFO executor.CoarseGrainedExecutorBackend: Successfully 
 registered with driver
 14/06/04 02:31:24 INFO spark.SecurityManager: Changing view acls to: 
 test1,yifeng
 14/06/04 02:31:24 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(test1, yifeng)
 14/06/04 02:31:24 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/06/04 02:31:24 INFO Remoting: Starting remoting
 14/06/04 02:31:24 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://spark@p7hvs7br16:58990]
 14/06/04 02:31:24 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://spark@p7hvs7br16:58990]
 14/06/04 02:31:24 INFO spark.SparkEnv: Connecting to MapOutputTracker: 
 akka.tcp://spark@9.186.105.141:60253/user/MapOutputTracker
 14/06/04 02:31:25 INFO spark.SparkEnv: Connecting to BlockManagerMaster: 
 akka.tcp://spark@9.186.105.141:60253/user/BlockManagerMaster
 14/06/04 02:31:25 INFO storage.DiskBlockManager: Created local directory at 
 /tmp/spark-local-20140604023125-3f61
 14/06/04 

[jira] [Commented] (SPARK-2018) Big-Endian (IBM Power7) Spark Serialization issue

2015-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539529#comment-14539529
 ] 

Apache Spark commented on SPARK-2018:
-

User 'tellison' has created a pull request for this issue:
https://github.com/apache/spark/pull/6077

 Big-Endian (IBM Power7)  Spark Serialization issue
 --

 Key: SPARK-2018
 URL: https://issues.apache.org/jira/browse/SPARK-2018
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
 Environment: hardware : IBM Power7
 OS:Linux version 2.6.32-358.el6.ppc64 
 (mockbu...@ppc-017.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
 Hat 4.4.7-3) (GCC) ) #1 SMP Tue Jan 29 11:43:27 EST 2013
 JDK: Java(TM) SE Runtime Environment (build pxp6470sr5-20130619_01(SR5))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux ppc64-64 Compressed References 
 20130617_152572 (JIT enabled, AOT enabled)
 Hadoop:Hadoop-0.2.3-CDH5.0
 Spark:Spark-1.0.0 or Spark-0.9.1
 spark-env.sh:
 export JAVA_HOME=/opt/ibm/java-ppc64-70/
 export SPARK_MASTER_IP=9.114.34.69
 export SPARK_WORKER_MEMORY=1m
 export SPARK_CLASSPATH=/home/test1/spark-1.0.0-bin-hadoop2/lib
 export  STANDALONE_SPARK_MASTER_HOST=9.114.34.69
 #export SPARK_JAVA_OPTS=' -Xdebug 
 -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n '
Reporter: Yanjie Gao

 We have an application run on Spark on Power7 System .
 But we meet an important issue about serialization.
 The example HdfsWordCount can meet the problem.
 ./bin/run-example  org.apache.spark.examples.streaming.HdfsWordCount 
 localdir
 We used Power7 (Big-Endian arch) and Redhat  6.4.
 Big-Endian  is the main cause since the example ran successfully in another 
 Power-based Little Endian setup.
 here is the exception stack and log:
 Spark Executor Command: /opt/ibm/java-ppc64-70//bin/java -cp 
 /home/test1/spark-1.0.0-bin-hadoop2/lib::/home/test1/src/spark-1.0.0-bin-hadoop2/conf:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/
  -XX:MaxPermSize=128m  -Xdebug 
 -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n -Xms512M 
 -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend 
 akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler 2 
 p7hvs7br16 4 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker 
 app-20140604023054-
 
 14/06/04 02:31:20 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 14/06/04 02:31:21 INFO spark.SecurityManager: Changing view acls to: 
 test1,yifeng
 14/06/04 02:31:21 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(test1, yifeng)
 14/06/04 02:31:22 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/06/04 02:31:22 INFO Remoting: Starting remoting
 14/06/04 02:31:22 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkExecutor@p7hvs7br16:39658]
 14/06/04 02:31:22 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://sparkExecutor@p7hvs7br16:39658]
 14/06/04 02:31:22 INFO executor.CoarseGrainedExecutorBackend: Connecting to 
 driver: akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler
 14/06/04 02:31:22 INFO worker.WorkerWatcher: Connecting to worker 
 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker
 14/06/04 02:31:23 INFO worker.WorkerWatcher: Successfully connected to 
 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker
 14/06/04 02:31:24 INFO executor.CoarseGrainedExecutorBackend: Successfully 
 registered with driver
 14/06/04 02:31:24 INFO spark.SecurityManager: Changing view acls to: 
 test1,yifeng
 14/06/04 02:31:24 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(test1, yifeng)
 14/06/04 02:31:24 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/06/04 02:31:24 INFO Remoting: Starting remoting
 14/06/04 02:31:24 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://spark@p7hvs7br16:58990]
 14/06/04 02:31:24 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://spark@p7hvs7br16:58990]
 14/06/04 02:31:24 INFO spark.SparkEnv: Connecting to MapOutputTracker: 
 akka.tcp://spark@9.186.105.141:60253/user/MapOutputTracker
 14/06/04 02:31:25 INFO spark.SparkEnv: Connecting to BlockManagerMaster: 
 akka.tcp://spark@9.186.105.141:60253/user/BlockManagerMaster
 14/06/04 

[jira] [Assigned] (SPARK-2018) Big-Endian (IBM Power7) Spark Serialization issue

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2018:
---

Assignee: Apache Spark

 Big-Endian (IBM Power7)  Spark Serialization issue
 --

 Key: SPARK-2018
 URL: https://issues.apache.org/jira/browse/SPARK-2018
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
 Environment: hardware : IBM Power7
 OS:Linux version 2.6.32-358.el6.ppc64 
 (mockbu...@ppc-017.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
 Hat 4.4.7-3) (GCC) ) #1 SMP Tue Jan 29 11:43:27 EST 2013
 JDK: Java(TM) SE Runtime Environment (build pxp6470sr5-20130619_01(SR5))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux ppc64-64 Compressed References 
 20130617_152572 (JIT enabled, AOT enabled)
 Hadoop:Hadoop-0.2.3-CDH5.0
 Spark:Spark-1.0.0 or Spark-0.9.1
 spark-env.sh:
 export JAVA_HOME=/opt/ibm/java-ppc64-70/
 export SPARK_MASTER_IP=9.114.34.69
 export SPARK_WORKER_MEMORY=1m
 export SPARK_CLASSPATH=/home/test1/spark-1.0.0-bin-hadoop2/lib
 export  STANDALONE_SPARK_MASTER_HOST=9.114.34.69
 #export SPARK_JAVA_OPTS=' -Xdebug 
 -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n '
Reporter: Yanjie Gao
Assignee: Apache Spark

 We have an application run on Spark on Power7 System .
 But we meet an important issue about serialization.
 The example HdfsWordCount can meet the problem.
 ./bin/run-example  org.apache.spark.examples.streaming.HdfsWordCount 
 localdir
 We used Power7 (Big-Endian arch) and Redhat  6.4.
 Big-Endian  is the main cause since the example ran successfully in another 
 Power-based Little Endian setup.
 here is the exception stack and log:
 Spark Executor Command: /opt/ibm/java-ppc64-70//bin/java -cp 
 /home/test1/spark-1.0.0-bin-hadoop2/lib::/home/test1/src/spark-1.0.0-bin-hadoop2/conf:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/
  -XX:MaxPermSize=128m  -Xdebug 
 -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n -Xms512M 
 -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend 
 akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler 2 
 p7hvs7br16 4 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker 
 app-20140604023054-
 
 14/06/04 02:31:20 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 14/06/04 02:31:21 INFO spark.SecurityManager: Changing view acls to: 
 test1,yifeng
 14/06/04 02:31:21 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(test1, yifeng)
 14/06/04 02:31:22 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/06/04 02:31:22 INFO Remoting: Starting remoting
 14/06/04 02:31:22 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkExecutor@p7hvs7br16:39658]
 14/06/04 02:31:22 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://sparkExecutor@p7hvs7br16:39658]
 14/06/04 02:31:22 INFO executor.CoarseGrainedExecutorBackend: Connecting to 
 driver: akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler
 14/06/04 02:31:22 INFO worker.WorkerWatcher: Connecting to worker 
 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker
 14/06/04 02:31:23 INFO worker.WorkerWatcher: Successfully connected to 
 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker
 14/06/04 02:31:24 INFO executor.CoarseGrainedExecutorBackend: Successfully 
 registered with driver
 14/06/04 02:31:24 INFO spark.SecurityManager: Changing view acls to: 
 test1,yifeng
 14/06/04 02:31:24 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(test1, yifeng)
 14/06/04 02:31:24 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/06/04 02:31:24 INFO Remoting: Starting remoting
 14/06/04 02:31:24 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://spark@p7hvs7br16:58990]
 14/06/04 02:31:24 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://spark@p7hvs7br16:58990]
 14/06/04 02:31:24 INFO spark.SparkEnv: Connecting to MapOutputTracker: 
 akka.tcp://spark@9.186.105.141:60253/user/MapOutputTracker
 14/06/04 02:31:25 INFO spark.SparkEnv: Connecting to BlockManagerMaster: 
 akka.tcp://spark@9.186.105.141:60253/user/BlockManagerMaster
 14/06/04 02:31:25 INFO storage.DiskBlockManager: Created local directory at 
 

[jira] [Assigned] (SPARK-7551) Don't split by dot if within backticks for DataFrame attribute resolution

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7551:
---

Assignee: Apache Spark  (was: Wenchen Fan)

 Don't split by dot if within backticks for DataFrame attribute resolution
 -

 Key: SPARK-7551
 URL: https://issues.apache.org/jira/browse/SPARK-7551
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark
Priority: Critical

 DataFrame's resolve:
 {code}
   protected[sql] def resolve(colName: String): NamedExpression = {
 queryExecution.analyzed.resolve(colName.split(\\.), 
 sqlContext.analyzer.resolver).getOrElse {
   throw new AnalysisException(
 sCannot resolve column name $colName among 
 (${schema.fieldNames.mkString(, )}))
 }
   }
 {code}
 We should not split the parts quoted by backticks (`).
 For example, `ab.cd`.`efg` should be split into two parts ab.cd and efg. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7551) Don't split by dot if within backticks for DataFrame attribute resolution

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7551:
---

Assignee: Wenchen Fan  (was: Apache Spark)

 Don't split by dot if within backticks for DataFrame attribute resolution
 -

 Key: SPARK-7551
 URL: https://issues.apache.org/jira/browse/SPARK-7551
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Wenchen Fan
Priority: Critical

 DataFrame's resolve:
 {code}
   protected[sql] def resolve(colName: String): NamedExpression = {
 queryExecution.analyzed.resolve(colName.split(\\.), 
 sqlContext.analyzer.resolver).getOrElse {
   throw new AnalysisException(
 sCannot resolve column name $colName among 
 (${schema.fieldNames.mkString(, )}))
 }
   }
 {code}
 We should not split the parts quoted by backticks (`).
 For example, `ab.cd`.`efg` should be split into two parts ab.cd and efg. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7551) Don't split by dot if within backticks for DataFrame attribute resolution

2015-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539385#comment-14539385
 ] 

Apache Spark commented on SPARK-7551:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/6074

 Don't split by dot if within backticks for DataFrame attribute resolution
 -

 Key: SPARK-7551
 URL: https://issues.apache.org/jira/browse/SPARK-7551
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Wenchen Fan
Priority: Critical

 DataFrame's resolve:
 {code}
   protected[sql] def resolve(colName: String): NamedExpression = {
 queryExecution.analyzed.resolve(colName.split(\\.), 
 sqlContext.analyzer.resolver).getOrElse {
   throw new AnalysisException(
 sCannot resolve column name $colName among 
 (${schema.fieldNames.mkString(, )}))
 }
   }
 {code}
 We should not split the parts quoted by backticks (`).
 For example, `ab.cd`.`efg` should be split into two parts ab.cd and efg. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7561) Install Junit Attachment Plugin on Jenkins

2015-05-12 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-7561:
--

 Summary: Install Junit Attachment Plugin on Jenkins
 Key: SPARK-7561
 URL: https://issues.apache.org/jira/browse/SPARK-7561
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: shane knapp


As part of SPARK-7560 I'd like to just attach the test output file to the 
Jenkins build. This is nicer than requiring someone have an SSH login to the 
master node.

Currently we gzip the logs, copy it to the master, and then delete them on the 
worker.
https://github.com/apache/spark/blob/master/dev/run-tests-jenkins#L132

Instead I think we can just gzip them and then have the attachment plugin add 
them to the build. But it would require installing this plug-in to see if we 
can get it working.

[~shaneknapp] not sure how willing you are to install plug-ins on Jenkins, but 
this one would be awesome if it's doable and we can get it working.

https://wiki.jenkins-ci.org/display/JENKINS/JUnit+Attachments+Plugin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7562) Improve error reporting for expression data type mismatch

2015-05-12 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-7562:
--

 Summary: Improve error reporting for expression data type mismatch
 Key: SPARK-7562
 URL: https://issues.apache.org/jira/browse/SPARK-7562
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


There is currently no error reporting for expression data types in analysis. It 
would be great to have that in checkAnalysis.

Ideally, it should be the responsibility of each Expression itself to specify 
the types it requires, and report errors that way. We would need to define a 
simple interface for that so each Expression can implement. The default 
implementation can just use the information provided by 
ExpectsInputTypes.expectedChildTypes. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7548) Add explode expression

2015-05-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539460#comment-14539460
 ] 

Reynold Xin commented on SPARK-7548:


cc [~cloud_fan] if you have time to do this today, try to take it over. 

 Add explode expression
 --

 Key: SPARK-7548
 URL: https://issues.apache.org/jira/browse/SPARK-7548
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7534) Fix the Stage table when a stage is missing

2015-05-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-7534.

   Resolution: Pending Closed
Fix Version/s: 1.4.0
 Assignee: Shixiong Zhu

 Fix the Stage table when a stage is missing
 ---

 Key: SPARK-7534
 URL: https://issues.apache.org/jira/browse/SPARK-7534
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor
 Fix For: 1.4.0


 Just improved the Stage table when a stage is missing.
 Please see the screenshots in https://github.com/apache/spark/pull/6061



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7467) DAG visualization: handle checkpoint correctly

2015-05-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-7467.

   Resolution: Fixed
Fix Version/s: 1.4.0

 DAG visualization: handle checkpoint correctly
 --

 Key: SPARK-7467
 URL: https://issues.apache.org/jira/browse/SPARK-7467
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or
 Fix For: 1.4.0


 We need to wrap RDD#doCheckpoint in a scope. Otherwise CheckpointRDDs may 
 belong to other operators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7558) Log test name when starting and finishing each test

2015-05-12 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-7558:
--

 Summary: Log test name when starting and finishing each test
 Key: SPARK-7558
 URL: https://issues.apache.org/jira/browse/SPARK-7558
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Reporter: Patrick Wendell
Assignee: Andrew Or


Right now it's really tough to interpret testing output because logs for 
different tests are interspersed in the same unit-tests.log file. This makes it 
particularly hard to diagnose flaky tests. This would be much easier if we 
logged the test name before and after every test (e.g. Starting test XX, 
Finished test XX). Then you could get right to the logs.

I think one way to do this might be to create a custom test fixture that logs 
the test class name and then mix that into every test suite /cc [~joshrosen] 
for his superb knowledge of Scalatest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7559) Bucketizer should include the right most boundary in the last bucket.

2015-05-12 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-7559:


 Summary: Bucketizer should include the right most boundary in the 
last bucket.
 Key: SPARK-7559
 URL: https://issues.apache.org/jira/browse/SPARK-7559
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor


Now we use special treatment for +inf.  This could be simplified by including 
the largest split value in the last bucket. E.g., (x1, x2, x3) defines buckets 
[x1, x2) and [x2, x3]. This shouldn't affect user code much, and there are 
applications that need to include the right-most value. For example, we can 
bucketize ratings from 0 to 10 to bad, neutral, and good with splits 0, 4, 6, 
10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7548) Add explode expression

2015-05-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539460#comment-14539460
 ] 

Reynold Xin edited comment on SPARK-7548 at 5/12/15 7:48 AM:
-

cc [~cloud_fan] if you have time to do this today, try to take it over. 
Basically creating an explode function in functions.scala and functions.py.



was (Author: rxin):
cc [~cloud_fan] if you have time to do this today, try to take it over. 

 Add explode expression
 --

 Key: SPARK-7548
 URL: https://issues.apache.org/jira/browse/SPARK-7548
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7560) Make flaky tests easier to debug

2015-05-12 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-7560:
--

 Summary: Make flaky tests easier to debug
 Key: SPARK-7560
 URL: https://issues.apache.org/jira/browse/SPARK-7560
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra, Tests
Reporter: Patrick Wendell


Right now it's really hard for people to even get the logs from a flakey test. 
Once you get the logs, it's very difficult to figure out what logs are 
associated with what tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4128) Create instructions on fully building Spark in Intellij

2015-05-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539480#comment-14539480
 ] 

Sean Owen commented on SPARK-4128:
--

It's still there... 
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ
The previous text was just outdated.

 Create instructions on fully building Spark in Intellij
 ---

 Key: SPARK-4128
 URL: https://issues.apache.org/jira/browse/SPARK-4128
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.2.0


 With some of our more complicated modules, I'm not sure whether Intellij 
 correctly understands all source locations. Also, we might require specifying 
 some profiles for the build to work directly. We should document clearly how 
 to start with vanilla Spark master and get the entire thing building in 
 Intellij.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7563) OutputCommitCoordinator.stop() should only be executed in driver

2015-05-12 Thread Hailong Wen (JIRA)
Hailong Wen created SPARK-7563:
--

 Summary: OutputCommitCoordinator.stop() should only be executed in 
driver
 Key: SPARK-7563
 URL: https://issues.apache.org/jira/browse/SPARK-7563
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
 Environment: Red Hat Enterprise Linux Server release 7.0 (Maipo)
Spark 1.3.1 Release
Reporter: Hailong Wen


I am from IBM Platform Symphony team and we are integrating Spark 1.3.1 with 
EGO (a resource management product).

In EGO we uses fine-grained dynamic allocation policy, and each Executor will 
exit after its tasks are all done. When testing *spark-shell*, we find that 
when executor of first job exit, it will stop OutputCommitCoordinator, which 
result in all future jobs failing. Details are as follows:

We got the following error in executor when submitting job in *spark-shell* the 
second time (the first job submission is successful):
{noformat}
15/05/11 04:02:31 INFO spark.util.AkkaUtils: Connecting to 
OutputCommitCoordinator: 
akka.tcp://sparkDriver@whlspark01:50452/user/OutputCommitCoordinator
Exception in thread main akka.actor.ActorNotFound: Actor not found for: 
ActorSelection[Anchor(akka.tcp://sparkDriver@whlspark01:50452/), 
Path(/user/OutputCommitCoordinator)]
at 
akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
at 
akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at 
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at 
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at 
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at 
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at 
scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at 
akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
at 
akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
at 
akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:89)
at 
akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{noformat}

And in driver side, we see a log message telling that the 
OutputCommitCoordinator is stopped after the first submission:
{noformat}
15/05/11 04:01:23 INFO 
spark.scheduler.OutputCommitCoordinator$OutputCommitCoordinatorActor: 
OutputCommitCoordinator stopped!
{noformat}

We examine the code of OutputCommitCoordinator, and find that executor will 
reuse the ref of driver's OutputCommitCoordinatorActor. So when an executor 
exits, it will eventually call SparkEnv.stop():
{noformat}
  private[spark] def stop() {
isStopped = true
pythonWorkers.foreach { case(key, worker) = worker.stop() }
Option(httpFileServer).foreach(_.stop())
mapOutputTracker.stop()
shuffleManager.stop()
broadcastManager.stop()
blockManager.stop()
blockManager.master.stop()
metricsSystem.stop()
outputCommitCoordinator.stop()  --- 
actorSystem.shutdown()
..
{noformat} 

and in OutputCommitCoordinator.stop():
{noformat}
  def stop(): Unit = synchronized {
coordinatorActor.foreach(_ ! StopCoordinator)
coordinatorActor = None
authorizedCommittersByStage.clear()
  }
{noformat}

We now work this problem around by adding an attribute isDriver in 
OutputCommitCoordinator and 

[jira] [Commented] (SPARK-7562) Improve error reporting for expression data type mismatch

2015-05-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539503#comment-14539503
 ] 

Reynold Xin commented on SPARK-7562:


This is related to https://issues.apache.org/jira/browse/SPARK-6444

and also there is one past attempt at this problem: 
https://github.com/apache/spark/pull/4685

#4685 pull request only marks expressions as unresolved, but doesn't report any 
error to users (e.g. we should explain why 1 + date is invalid).

cc [~kai-zeng]



 Improve error reporting for expression data type mismatch
 -

 Key: SPARK-7562
 URL: https://issues.apache.org/jira/browse/SPARK-7562
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin

 There is currently no error reporting for expression data types in analysis 
 (we rely on resolved for that, which doesn't provide great error messages 
 for types). It would be great to have that in checkAnalysis.
 Ideally, it should be the responsibility of each Expression itself to specify 
 the types it requires, and report errors that way. We would need to define a 
 simple interface for that so each Expression can implement. The default 
 implementation can just use the information provided by 
 ExpectsInputTypes.expectedChildTypes. 
 cc [~marmbrus] what we discussed offline today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7500) DAG visualization: cluster name bleeds beyond the cluster

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7500:
---

Assignee: Andrew Or  (was: Apache Spark)

 DAG visualization: cluster name bleeds beyond the cluster
 -

 Key: SPARK-7500
 URL: https://issues.apache.org/jira/browse/SPARK-7500
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Minor
 Attachments: long names.png


 This happens only for long names. See screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7485) Remove python artifacts from the assembly jar

2015-05-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-7485.

Resolution: Fixed

 Remove python artifacts from the assembly jar
 -

 Key: SPARK-7485
 URL: https://issues.apache.org/jira/browse/SPARK-7485
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.4.0
Reporter: Thomas Graves
Assignee: Marcelo Vanzin
 Fix For: 1.4.0


 We change it so that we distributed the python files via a zip file in 
 SPARK-6869.  With that we should remove the python files from the assembly 
 jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-7485) Remove python artifacts from the assembly jar

2015-05-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reopened SPARK-7485:
--

 Remove python artifacts from the assembly jar
 -

 Key: SPARK-7485
 URL: https://issues.apache.org/jira/browse/SPARK-7485
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.4.0
Reporter: Thomas Graves
Assignee: Marcelo Vanzin
 Fix For: 1.4.0


 We change it so that we distributed the python files via a zip file in 
 SPARK-6869.  With that we should remove the python files from the assembly 
 jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7485) Remove python artifacts from the assembly jar

2015-05-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-7485.

   Resolution: Pending Closed
Fix Version/s: 1.4.0
 Assignee: Marcelo Vanzin

 Remove python artifacts from the assembly jar
 -

 Key: SPARK-7485
 URL: https://issues.apache.org/jira/browse/SPARK-7485
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.4.0
Reporter: Thomas Graves
Assignee: Marcelo Vanzin
 Fix For: 1.4.0


 We change it so that we distributed the python files via a zip file in 
 SPARK-6869.  With that we should remove the python files from the assembly 
 jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7559) Bucketizer should include the right most boundary in the last bucket.

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7559:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

 Bucketizer should include the right most boundary in the last bucket.
 -

 Key: SPARK-7559
 URL: https://issues.apache.org/jira/browse/SPARK-7559
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Apache Spark
Priority: Minor

 Now we use special treatment for +inf.  This could be simplified by including 
 the largest split value in the last bucket. E.g., (x1, x2, x3) defines 
 buckets [x1, x2) and [x2, x3]. This shouldn't affect user code much, and 
 there are applications that need to include the right-most value. For 
 example, we can bucketize ratings from 0 to 10 to bad, neutral, and good with 
 splits 0, 4, 6, 10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7559) Bucketizer should include the right most boundary in the last bucket.

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7559:
---

Assignee: Xiangrui Meng  (was: Apache Spark)

 Bucketizer should include the right most boundary in the last bucket.
 -

 Key: SPARK-7559
 URL: https://issues.apache.org/jira/browse/SPARK-7559
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor

 Now we use special treatment for +inf.  This could be simplified by including 
 the largest split value in the last bucket. E.g., (x1, x2, x3) defines 
 buckets [x1, x2) and [x2, x3]. This shouldn't affect user code much, and 
 there are applications that need to include the right-most value. For 
 example, we can bucketize ratings from 0 to 10 to bad, neutral, and good with 
 splits 0, 4, 6, 10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7559) Bucketizer should include the right most boundary in the last bucket.

2015-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539465#comment-14539465
 ] 

Apache Spark commented on SPARK-7559:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/6075

 Bucketizer should include the right most boundary in the last bucket.
 -

 Key: SPARK-7559
 URL: https://issues.apache.org/jira/browse/SPARK-7559
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor

 Now we use special treatment for +inf.  This could be simplified by including 
 the largest split value in the last bucket. E.g., (x1, x2, x3) defines 
 buckets [x1, x2) and [x2, x3]. This shouldn't affect user code much, and 
 there are applications that need to include the right-most value. For 
 example, we can bucketize ratings from 0 to 10 to bad, neutral, and good with 
 splits 0, 4, 6, 10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7558) Log test name when starting and finishing each test

2015-05-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7558:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-7560

 Log test name when starting and finishing each test
 ---

 Key: SPARK-7558
 URL: https://issues.apache.org/jira/browse/SPARK-7558
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Reporter: Patrick Wendell
Assignee: Andrew Or

 Right now it's really tough to interpret testing output because logs for 
 different tests are interspersed in the same unit-tests.log file. This makes 
 it particularly hard to diagnose flaky tests. This would be much easier if we 
 logged the test name before and after every test (e.g. Starting test XX, 
 Finished test XX). Then you could get right to the logs.
 I think one way to do this might be to create a custom test fixture that logs 
 the test class name and then mix that into every test suite /cc [~joshrosen] 
 for his superb knowledge of Scalatest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4128) Create instructions on fully building Spark in Intellij

2015-05-12 Thread Christian Kadner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540217#comment-14540217
 ] 

Christian Kadner edited comment on SPARK-4128 at 5/12/15 5:04 PM:
--

Hi Sean,

while there is still a section covering the IntelliJ setup, what is missing are 
these steps (or an updated version of it) which have to be taken in order to 
get a successfully Make of the project. I needed to do some version of it for 
1.3.0, 1.3.1, 1.4.0.

part of Patrick's deleted paragraph - start
...
At the top of the leftmost pane, make sure the Project/Packages selector 
is set to Packages.
Right click on any package and click “Open Module Settings” - you will be 
able to modify any of the modules here.
A few of the modules need to be modified slightly from the default import.
Add sources to the following modules: Under “Sources” tab add a source 
on the right. 
spark-hive: add v0.13.1/src/main/scala
spark-hive-thriftserver v0.13.1/src/main/scala
spark-repl: scala-2.10/src/main/scala and scala-2.10/src/test/scala
For spark-yarn click “Add content root” and navigate in the filesystem 
to yarn/common directory of Spark
...
part of Patrick's deleted paragraph - end


I suggest to add an updated version of that to the wiki, since some of the 
Modules are setup in a way that similar non-obvious manual steps are required 
to make them compile.


was (Author: ckadner):
Hi Sean,

while there is still a section covering the IntelliJ setup, what is missing are 
these steps, or an updated version of it, which I had to do for 1.3.0, 1.3.1, 
1.4.0 in order to get a successfully Make of the project.

part of Patrick's deleted paragraph - start
...
At the top of the leftmost pane, make sure the Project/Packages selector 
is set to Packages.
Right click on any package and click “Open Module Settings” - you will be 
able to modify any of the modules here.
A few of the modules need to be modified slightly from the default import.
Add sources to the following modules: Under “Sources” tab add a source 
on the right. 
spark-hive: add v0.13.1/src/main/scala
spark-hive-thriftserver v0.13.1/src/main/scala
spark-repl: scala-2.10/src/main/scala and scala-2.10/src/test/scala
For spark-yarn click “Add content root” and navigate in the filesystem 
to yarn/common directory of Spark
...
part of Patrick's deleted paragraph - end


I suggest to add an updated version of that to the wiki, since some of the 
Modules are setup in a way that similar non-obvious manual steps are required 
to make them compile.

 Create instructions on fully building Spark in Intellij
 ---

 Key: SPARK-4128
 URL: https://issues.apache.org/jira/browse/SPARK-4128
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.2.0


 With some of our more complicated modules, I'm not sure whether Intellij 
 correctly understands all source locations. Also, we might require specifying 
 some profiles for the build to work directly. We should document clearly how 
 to start with vanilla Spark master and get the entire thing building in 
 Intellij.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6749) Make metastore client robust to underlying socket connection loss

2015-05-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6749:

Assignee: (was: Yin Huai)

 Make metastore client robust to underlying socket connection loss
 -

 Key: SPARK-6749
 URL: https://issues.apache.org/jira/browse/SPARK-6749
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Priority: Critical

 Right now, if metastore get restarted, we have to restart the driver to get a 
 new connection to the metastore client because the underlying socket 
 connection is gone. We should make metastore client robust to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7566) HiveContext.analyzer cannot be overriden

2015-05-12 Thread Santiago M. Mola (JIRA)
Santiago M. Mola created SPARK-7566:
---

 Summary: HiveContext.analyzer cannot be overriden
 Key: SPARK-7566
 URL: https://issues.apache.org/jira/browse/SPARK-7566
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Santiago M. Mola


Trying to override HiveContext.analyzer will give the following compilation 
error:

{code}
Error:(51, 36) overriding lazy value analyzer in class HiveContext of type 
org.apache.spark.sql.catalyst.analysis.Analyzer{val extendedResolutionRules: 
List[org.apache.spark.sql.catalyst.rules.Rule[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]]};
 lazy value analyzer has incompatible type
  override protected[sql] lazy val analyzer: Analyzer = {
   ^
{code}

That is because the type changed inadvertedly when omitting the type 
declaration of the return type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6876) DataFrame.na.replace value support for Python

2015-05-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-6876.

   Resolution: Pending Closed
Fix Version/s: 1.4.0

 DataFrame.na.replace value support for Python
 -

 Key: SPARK-6876
 URL: https://issues.apache.org/jira/browse/SPARK-6876
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Adrian Wang
 Fix For: 1.4.0


 Scala/Java support is in. We should provide the Python version, similar to 
 what Pandas supports.
 http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5182) Partitioning support for tables created by the data source API

2015-05-12 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-5182.
---
   Resolution: Pending Closed
Fix Version/s: 1.4.0

Issue resolved by pull request 5526
[https://github.com/apache/spark/pull/5526]

 Partitioning support for tables created by the data source API
 --

 Key: SPARK-5182
 URL: https://issues.apache.org/jira/browse/SPARK-5182
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7380) Python: Transformer/Estimator should be copyable

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7380:
---

Assignee: Apache Spark  (was: Joseph K. Bradley)

 Python: Transformer/Estimator should be copyable
 

 Key: SPARK-7380
 URL: https://issues.apache.org/jira/browse/SPARK-7380
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Joseph K. Bradley
Assignee: Apache Spark

 Same as [SPARK-5956]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7380) Python: Transformer/Estimator should be copyable

2015-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540398#comment-14540398
 ] 

Apache Spark commented on SPARK-7380:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/6088

 Python: Transformer/Estimator should be copyable
 

 Key: SPARK-7380
 URL: https://issues.apache.org/jira/browse/SPARK-7380
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 Same as [SPARK-5956]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7380) Python: Transformer/Estimator should be copyable

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7380:
---

Assignee: Joseph K. Bradley  (was: Apache Spark)

 Python: Transformer/Estimator should be copyable
 

 Key: SPARK-7380
 URL: https://issues.apache.org/jira/browse/SPARK-7380
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 Same as [SPARK-5956]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6749) Make metastore client robust to underlying socket connection loss

2015-05-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6749:

Assignee: Yin Huai

 Make metastore client robust to underlying socket connection loss
 -

 Key: SPARK-6749
 URL: https://issues.apache.org/jira/browse/SPARK-6749
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical

 Right now, if metastore get restarted, we have to restart the driver to get a 
 new connection to the metastore client because the underlying socket 
 connection is gone. We should make metastore client robust to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6749) Make metastore client robust to underlying socket connection loss

2015-05-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6749:

Priority: Critical  (was: Major)

 Make metastore client robust to underlying socket connection loss
 -

 Key: SPARK-6749
 URL: https://issues.apache.org/jira/browse/SPARK-6749
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Priority: Critical

 Right now, if metastore get restarted, we have to restart the driver to get a 
 new connection to the metastore client because the underlying socket 
 connection is gone. We should make metastore client robust to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6980) Akka timeout exceptions indicate which conf controls them

2015-05-12 Thread Harsh Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540335#comment-14540335
 ] 

Harsh Gupta commented on SPARK-6980:


[~bryanc] can you update with the progress so that we can share the work load ?

 Akka timeout exceptions indicate which conf controls them
 -

 Key: SPARK-6980
 URL: https://issues.apache.org/jira/browse/SPARK-6980
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Imran Rashid
Assignee: Harsh Gupta
Priority: Minor
  Labels: starter
 Attachments: Spark-6980-Test.scala


 If you hit one of the akka timeouts, you just get an exception like
 {code}
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 {code}
 The exception doesn't indicate how to change the timeout, though there is 
 usually (always?) a corresponding setting in {{SparkConf}} .  It would be 
 nice if the exception including the relevant setting.
 I think this should be pretty easy to do -- we just need to create something 
 like a {{NamedTimeout}}.  It would have its own {{await}} method, catches the 
 akka timeout and throws its own exception.  We should change 
 {{RpcUtils.askTimeout}} and {{RpcUtils.lookupTimeout}} to always give a 
 {{NamedTimeout}}, so we can be sure that anytime we have a timeout, we get a 
 better exception.
 Given the latest refactoring to the rpc layer, this needs to be done in both 
 {{AkkaUtils}} and {{AkkaRpcEndpoint}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4128) Create instructions on fully building Spark in Intellij

2015-05-12 Thread Christian Kadner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540217#comment-14540217
 ] 

Christian Kadner commented on SPARK-4128:
-

Hi Sean,

while there is still a section covering the IntelliJ setup, what is missing are 
these steps, or an updated version of it, which I had to do for 1.3.0, 1.3.1, 
1.4.0 in order to get a successfully Make of the project.

part of Patrick's deleted paragraph - start
...
At the top of the leftmost pane, make sure the Project/Packages selector 
is set to Packages.
Right click on any package and click “Open Module Settings” - you will be 
able to modify any of the modules here.
A few of the modules need to be modified slightly from the default import.
Add sources to the following modules: Under “Sources” tab add a source 
on the right. 
spark-hive: add v0.13.1/src/main/scala
spark-hive-thriftserver v0.13.1/src/main/scala
spark-repl: scala-2.10/src/main/scala and scala-2.10/src/test/scala
For spark-yarn click “Add content root” and navigate in the filesystem 
to yarn/common directory of Spark
part of Patrick's deleted paragraph - end


I suggest to add an updated version of that to the wiki, since some of the 
Modules are setup in a way that similar non-obvious manual steps are required 
to make them compile.

 Create instructions on fully building Spark in Intellij
 ---

 Key: SPARK-4128
 URL: https://issues.apache.org/jira/browse/SPARK-4128
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.2.0


 With some of our more complicated modules, I'm not sure whether Intellij 
 correctly understands all source locations. Also, we might require specifying 
 some profiles for the build to work directly. We should document clearly how 
 to start with vanilla Spark master and get the entire thing building in 
 Intellij.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4128) Create instructions on fully building Spark in Intellij

2015-05-12 Thread Christian Kadner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540217#comment-14540217
 ] 

Christian Kadner edited comment on SPARK-4128 at 5/12/15 5:02 PM:
--

Hi Sean,

while there is still a section covering the IntelliJ setup, what is missing are 
these steps, or an updated version of it, which I had to do for 1.3.0, 1.3.1, 
1.4.0 in order to get a successfully Make of the project.

part of Patrick's deleted paragraph - start
...
At the top of the leftmost pane, make sure the Project/Packages selector 
is set to Packages.
Right click on any package and click “Open Module Settings” - you will be 
able to modify any of the modules here.
A few of the modules need to be modified slightly from the default import.
Add sources to the following modules: Under “Sources” tab add a source 
on the right. 
spark-hive: add v0.13.1/src/main/scala
spark-hive-thriftserver v0.13.1/src/main/scala
spark-repl: scala-2.10/src/main/scala and scala-2.10/src/test/scala
For spark-yarn click “Add content root” and navigate in the filesystem 
to yarn/common directory of Spark
...
part of Patrick's deleted paragraph - end


I suggest to add an updated version of that to the wiki, since some of the 
Modules are setup in a way that similar non-obvious manual steps are required 
to make them compile.


was (Author: ckadner):
Hi Sean,

while there is still a section covering the IntelliJ setup, what is missing are 
these steps, or an updated version of it, which I had to do for 1.3.0, 1.3.1, 
1.4.0 in order to get a successfully Make of the project.

part of Patrick's deleted paragraph - start
...
At the top of the leftmost pane, make sure the Project/Packages selector 
is set to Packages.
Right click on any package and click “Open Module Settings” - you will be 
able to modify any of the modules here.
A few of the modules need to be modified slightly from the default import.
Add sources to the following modules: Under “Sources” tab add a source 
on the right. 
spark-hive: add v0.13.1/src/main/scala
spark-hive-thriftserver v0.13.1/src/main/scala
spark-repl: scala-2.10/src/main/scala and scala-2.10/src/test/scala
For spark-yarn click “Add content root” and navigate in the filesystem 
to yarn/common directory of Spark
part of Patrick's deleted paragraph - end


I suggest to add an updated version of that to the wiki, since some of the 
Modules are setup in a way that similar non-obvious manual steps are required 
to make them compile.

 Create instructions on fully building Spark in Intellij
 ---

 Key: SPARK-4128
 URL: https://issues.apache.org/jira/browse/SPARK-4128
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.2.0


 With some of our more complicated modules, I'm not sure whether Intellij 
 correctly understands all source locations. Also, we might require specifying 
 some profiles for the build to work directly. We should document clearly how 
 to start with vanilla Spark master and get the entire thing building in 
 Intellij.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7525) Could not read data from write ahead log record when Receiver failed and WAL is stored in Tachyon

2015-05-12 Thread Dibyendu Bhattacharya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540325#comment-14540325
 ] 

Dibyendu Bhattacharya commented on SPARK-7525:
--

I guess this is something to do with the lack of Tachyon Append Support. 

java.lang.IllegalStateException: File exists and there is no append support!
at 
org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33)
at 
org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.org$apache$spark$streaming$util$FileBasedWriteAheadLogWriter$$stream$lzycompute(FileBasedWriteAheadLogWriter.scala:33)
at 
org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.org$apache$spark$streaming$util$FileBasedWriteAheadLogWriter$$stream(FileBasedWriteAheadLogWriter.scala:33)
at 
org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.init(FileBasedWriteAheadLogWriter.scala:41)
at 
org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:194)
at 
org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:81)
at 
org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:44)
at 
org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler$$anonfun$5.apply(ReceivedBlockHandler.scala:178)
at 
org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler$$anonfun$5.apply(ReceivedBlockHandler.scala:178)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

 Could not read data from write ahead log record when Receiver failed and WAL 
 is stored in Tachyon
 -

 Key: SPARK-7525
 URL: https://issues.apache.org/jira/browse/SPARK-7525
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
 Environment: AWS , Spark Streaming 1.4 with Tachyon 0.6.4
Reporter: Dibyendu Bhattacharya

 I was testing Fault Tolerant aspect of Spark Streaming when Checkpoint 
 directory is stored in Tachyon. Spark Streaming is able to recover from 
 Driver failure , but when Receiver Failed, Spark Streaming not able read the 
 WAL files written by failed Receiver. Below is exception when Receiver is 
 failed .
 INFO : org.apache.spark.scheduler.DAGScheduler - Executor lost: 2 (epoch 1)
 INFO : org.apache.spark.storage.BlockManagerMasterEndpoint - Trying to remove 
 executor 2 from BlockManagerMaster.
 INFO : org.apache.spark.storage.BlockManagerMasterEndpoint - Removing block 
 manager BlockManagerId(2, 10.252.5.54, 45789)
 INFO : org.apache.spark.storage.BlockManagerMaster - Removed 2 successfully 
 in removeExecutor
 INFO : org.apache.spark.streaming.scheduler.ReceiverTracker - Registered 
 receiver for stream 2 from 10.252.5.62:47255
 WARN : org.apache.spark.scheduler.TaskSetManager - Lost task 2.1 in stage 
 103.0 (TID 421, 10.252.5.62): org.apache.spark.SparkException: Could not read 
 data from write ahead log record 
 FileBasedWriteAheadLogSegment(tachyon-ft://10.252.5.113:19998/tachyon/checkpoint/receivedData/2/log-1431341091711-1431341151711,645603894,10891919)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:144)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:168)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:168)
   at scala.Option.getOrElse(Option.scala:120)
   at 
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:168)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 

[jira] [Assigned] (SPARK-6258) Python MLlib API missing items: Clustering

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6258:
---

Assignee: Apache Spark

 Python MLlib API missing items: Clustering
 --

 Key: SPARK-6258
 URL: https://issues.apache.org/jira/browse/SPARK-6258
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Apache Spark

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 KMeans
 * setEpsilon
 * setInitializationSteps
 KMeansModel
 * computeCost
 * k
 GaussianMixture
 * setInitialModel
 GaussianMixtureModel
 * k
 Completely missing items which should be fixed in separate JIRAs (which have 
 been created and linked to the umbrella JIRA)
 * LDA
 * PowerIterationClustering
 * StreamingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6258) Python MLlib API missing items: Clustering

2015-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540268#comment-14540268
 ] 

Apache Spark commented on SPARK-6258:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6087

 Python MLlib API missing items: Clustering
 --

 Key: SPARK-6258
 URL: https://issues.apache.org/jira/browse/SPARK-6258
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 KMeans
 * setEpsilon
 * setInitializationSteps
 KMeansModel
 * computeCost
 * k
 GaussianMixture
 * setInitialModel
 GaussianMixtureModel
 * k
 Completely missing items which should be fixed in separate JIRAs (which have 
 been created and linked to the umbrella JIRA)
 * LDA
 * PowerIterationClustering
 * StreamingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6258) Python MLlib API missing items: Clustering

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6258:
---

Assignee: (was: Apache Spark)

 Python MLlib API missing items: Clustering
 --

 Key: SPARK-6258
 URL: https://issues.apache.org/jira/browse/SPARK-6258
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 KMeans
 * setEpsilon
 * setInitializationSteps
 KMeansModel
 * computeCost
 * k
 GaussianMixture
 * setInitialModel
 GaussianMixtureModel
 * k
 Completely missing items which should be fixed in separate JIRAs (which have 
 been created and linked to the umbrella JIRA)
 * LDA
 * PowerIterationClustering
 * StreamingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4128) Create instructions on fully building Spark in Intellij

2015-05-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540243#comment-14540243
 ] 

Sean Owen commented on SPARK-4128:
--

Some of this isn't correct, like the YARN bit. Some of this isn't applicable to 
all users, like those that don't need Hive. That's why they were removed as 
required setup.

 Create instructions on fully building Spark in Intellij
 ---

 Key: SPARK-4128
 URL: https://issues.apache.org/jira/browse/SPARK-4128
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.2.0


 With some of our more complicated modules, I'm not sure whether Intellij 
 correctly understands all source locations. Also, we might require specifying 
 some profiles for the build to work directly. We should document clearly how 
 to start with vanilla Spark master and get the entire thing building in 
 Intellij.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4128) Create instructions on fully building Spark in Intellij

2015-05-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540410#comment-14540410
 ] 

Sean Owen commented on SPARK-4128:
--

I don't think I had to do anything special to get Hive working (it's enabled 
for me). Are you certain that it doesn't recognize the source folder? the 
source should be in the place the build says it is and IJ understands that. 
That said there have been all kinds of wild glitches over time. If it is really 
required from a clean checkout / new project, well, yeah that can be doc'ed but 
I also want to fix it!

Yeah the Scala 2.11/10 support is handled outside of any of the build scripts. 
It should work either way if you run the script to switch between them but 
certainly needs a reimport.

 Create instructions on fully building Spark in Intellij
 ---

 Key: SPARK-4128
 URL: https://issues.apache.org/jira/browse/SPARK-4128
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.2.0


 With some of our more complicated modules, I'm not sure whether Intellij 
 correctly understands all source locations. Also, we might require specifying 
 some profiles for the build to work directly. We should document clearly how 
 to start with vanilla Spark master and get the entire thing building in 
 Intellij.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7567) Migrating Parquet data source to FSBasedRelation

2015-05-12 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-7567:
-

 Summary: Migrating Parquet data source to FSBasedRelation
 Key: SPARK-7567
 URL: https://issues.apache.org/jira/browse/SPARK-7567
 Project: Spark
  Issue Type: Bug
Reporter: Cheng Lian
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7569) Improve error for binary expressions

2015-05-12 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-7569:
---

 Summary: Improve error for binary expressions
 Key: SPARK-7569
 URL: https://issues.apache.org/jira/browse/SPARK-7569
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical


This is not a great error:
{code}
scala Seq((1,1)).toDF(a, b).select(lit(1) + new java.sql.Date(1)) 
org.apache.spark.sql.AnalysisException: invalid expression (1 + 0) between 
Literal 1, IntegerType and Literal 0, DateType;
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7567) Migrating Parquet data source to FSBasedRelation

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7567:
---

Assignee: Cheng Lian  (was: Apache Spark)

 Migrating Parquet data source to FSBasedRelation
 

 Key: SPARK-7567
 URL: https://issues.apache.org/jira/browse/SPARK-7567
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7568) ml.LogisticRegression doesn't output the right prediction

2015-05-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7568:
-
Description: 
`bin/spark-submit 
examples/src/main/python/ml/simple_text_classification_pipeline.py`

{code}
Row(id=4, text=u'spark i j k', words=[u'spark', u'i', u'j', u'k'], 
features=SparseVector(262144, {105: 1.0, 106: 1.0, 107: 1.0, 62173: 1.0}), 
rawPrediction=DenseVector([0.1629, -0.1629]), probability=DenseVector([0.5406, 
0.4594]), prediction=0.0)
Row(id=5, text=u'l m n', words=[u'l', u'm', u'n'], 
features=SparseVector(262144, {108: 1.0, 109: 1.0, 110: 1.0}), 
rawPrediction=DenseVector([2.6407, -2.6407]), probability=DenseVector([0.9334, 
0.0666]), prediction=0.0)
Row(id=6, text=u'mapreduce spark', words=[u'mapreduce', u'spark'], 
features=SparseVector(262144, {62173: 1.0, 140738: 1.0}), 
rawPrediction=DenseVector([1.2651, -1.2651]), probability=DenseVector([0.7799, 
0.2201]), prediction=0.0)
Row(id=7, text=u'apache hadoop', words=[u'apache', u'hadoop'], 
features=SparseVector(262144, {128334: 1.0, 134181: 1.0}), 
rawPrediction=DenseVector([3.7429, -3.7429]), probability=DenseVector([0.9769, 
0.0231]), prediction=0.0)
{code}

In Scala

{code}
$ bin/run-example ml.SimpleTextClassificationPipeline

(4, spark i j k) -- prob=[0.5406433544851436,0.45935664551485655], 
prediction=0.0
(5, l m n) -- prob=[0.9334382627383263,0.06656173726167364], prediction=0.0
(6, mapreduce spark) -- prob=[0.7799076868203896,0.22009231317961045], 
prediction=0.0
(7, apache hadoop) -- prob=[0.9768636139518304,0.023136386048169616], 
prediction=0.0
{code}

All predictions are 0, while some should be one based on the probability. It 
seems to be an issue with regularization.

  was:
`bin/spark-submit 
examples/src/main/python/ml/simple_text_classification_pipeline.py`

{code}
Row(id=4, text=u'spark i j k', words=[u'spark', u'i', u'j', u'k'], 
features=SparseVector(262144, {105: 1.0, 106: 1.0, 107: 1.0, 62173: 1.0}), 
rawPrediction=DenseVector([0.1629, -0.1629]), probability=DenseVector([0.5406, 
0.4594]), prediction=0.0)
Row(id=5, text=u'l m n', words=[u'l', u'm', u'n'], 
features=SparseVector(262144, {108: 1.0, 109: 1.0, 110: 1.0}), 
rawPrediction=DenseVector([2.6407, -2.6407]), probability=DenseVector([0.9334, 
0.0666]), prediction=0.0)
Row(id=6, text=u'mapreduce spark', words=[u'mapreduce', u'spark'], 
features=SparseVector(262144, {62173: 1.0, 140738: 1.0}), 
rawPrediction=DenseVector([1.2651, -1.2651]), probability=DenseVector([0.7799, 
0.2201]), prediction=0.0)
Row(id=7, text=u'apache hadoop', words=[u'apache', u'hadoop'], 
features=SparseVector(262144, {128334: 1.0, 134181: 1.0}), 
rawPrediction=DenseVector([3.7429, -3.7429]), probability=DenseVector([0.9769, 
0.0231]), prediction=0.0)
{code}

All predictions are 0, while some should be one based on the probability. It 
seems to be an issue with regularization.


 ml.LogisticRegression doesn't output the right prediction
 -

 Key: SPARK-7568
 URL: https://issues.apache.org/jira/browse/SPARK-7568
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: DB Tsai
Priority: Blocker

 `bin/spark-submit 
 examples/src/main/python/ml/simple_text_classification_pipeline.py`
 {code}
 Row(id=4, text=u'spark i j k', words=[u'spark', u'i', u'j', u'k'], 
 features=SparseVector(262144, {105: 1.0, 106: 1.0, 107: 1.0, 62173: 1.0}), 
 rawPrediction=DenseVector([0.1629, -0.1629]), 
 probability=DenseVector([0.5406, 0.4594]), prediction=0.0)
 Row(id=5, text=u'l m n', words=[u'l', u'm', u'n'], 
 features=SparseVector(262144, {108: 1.0, 109: 1.0, 110: 1.0}), 
 rawPrediction=DenseVector([2.6407, -2.6407]), 
 probability=DenseVector([0.9334, 0.0666]), prediction=0.0)
 Row(id=6, text=u'mapreduce spark', words=[u'mapreduce', u'spark'], 
 features=SparseVector(262144, {62173: 1.0, 140738: 1.0}), 
 rawPrediction=DenseVector([1.2651, -1.2651]), 
 probability=DenseVector([0.7799, 0.2201]), prediction=0.0)
 Row(id=7, text=u'apache hadoop', words=[u'apache', u'hadoop'], 
 features=SparseVector(262144, {128334: 1.0, 134181: 1.0}), 
 rawPrediction=DenseVector([3.7429, -3.7429]), 
 probability=DenseVector([0.9769, 0.0231]), prediction=0.0)
 {code}
 In Scala
 {code}
 $ bin/run-example ml.SimpleTextClassificationPipeline
 (4, spark i j k) -- prob=[0.5406433544851436,0.45935664551485655], 
 prediction=0.0
 (5, l m n) -- prob=[0.9334382627383263,0.06656173726167364], prediction=0.0
 (6, mapreduce spark) -- prob=[0.7799076868203896,0.22009231317961045], 
 prediction=0.0
 (7, apache hadoop) -- prob=[0.9768636139518304,0.023136386048169616], 
 prediction=0.0
 {code}
 All predictions are 0, while some should be one based on the probability. It 
 seems to be an issue with regularization.



--
This message 

[jira] [Commented] (SPARK-7561) Install Junit Attachment Plugin on Jenkins

2015-05-12 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540498#comment-14540498
 ] 

shane knapp commented on SPARK-7561:


it's installed, but i will need to restart jenkins one morning to activate the 
plugin. 

 Install Junit Attachment Plugin on Jenkins
 --

 Key: SPARK-7561
 URL: https://issues.apache.org/jira/browse/SPARK-7561
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: shane knapp

 As part of SPARK-7560 I'd like to just attach the test output file to the 
 Jenkins build. This is nicer than requiring someone have an SSH login to the 
 master node.
 Currently we gzip the logs, copy it to the master, and then delete them on 
 the worker.
 https://github.com/apache/spark/blob/master/dev/run-tests-jenkins#L132
 Instead I think we can just gzip them and then have the attachment plugin add 
 them to the build. But it would require installing this plug-in to see if we 
 can get it working.
 [~shaneknapp] not sure how willing you are to install plug-ins on Jenkins, 
 but this one would be awesome if it's doable and we can get it working.
 https://wiki.jenkins-ci.org/display/JENKINS/JUnit+Attachments+Plugin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4128) Create instructions on fully building Spark in Intellij

2015-05-12 Thread Christian Kadner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540502#comment-14540502
 ] 

Christian Kadner edited comment on SPARK-4128 at 5/12/15 7:11 PM:
--

Yes, I encountered these compile problems after a fresh import of the Spark 
1.3.0 and 1.3.1 project from download (.tgz) and 1.4 when loaded from a Git 
repository.

For Scala 2.10/2.11 support, I suppose either one should be chosen by default 
without having to run a script. Btw, that should be doc'd as well ;-)


was (Author: ckadner):
Yes, I encountered these compile problems after a fresh import of the Spark 1.4 
project both when downloaded (tar/zip) and when loaded from a Git repository.

For Scala 2.10/2.11 support, I suppose either one should be chosen by default 
without having to run a script. Btw, that should be doc'd as well ;-)

 Create instructions on fully building Spark in Intellij
 ---

 Key: SPARK-4128
 URL: https://issues.apache.org/jira/browse/SPARK-4128
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.2.0


 With some of our more complicated modules, I'm not sure whether Intellij 
 correctly understands all source locations. Also, we might require specifying 
 some profiles for the build to work directly. We should document clearly how 
 to start with vanilla Spark master and get the entire thing building in 
 Intellij.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7571) Rename `Math` to `math` in MLlib's Scala code

2015-05-12 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-7571:


 Summary: Rename `Math` to `math` in MLlib's Scala code
 Key: SPARK-7571
 URL: https://issues.apache.org/jira/browse/SPARK-7571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Trivial


scala.Math was deprecated since 2.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7422) Add argmax to Vector, SparseVector

2015-05-12 Thread George Dittmar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540574#comment-14540574
 ] 

George Dittmar commented on SPARK-7422:
---

Yep will do. Do you want me to hold off on the PR for the other jira until this 
one gets merged in or can I just put them in at the same time but separate?

 Add argmax to Vector, SparseVector
 --

 Key: SPARK-7422
 URL: https://issues.apache.org/jira/browse/SPARK-7422
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 DenseVector has an argmax method which is currently private to Spark.  It 
 would be nice to add that method to Vector and SparseVector.  Adding it to 
 SparseVector would require being careful about handling the inactive elements 
 correctly and efficiently.
 We should make argmax public and add unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4128) Create instructions on fully building Spark in Intellij

2015-05-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540550#comment-14540550
 ] 

Sean Owen commented on SPARK-4128:
--

OK, propose the text you want to add back and I'll put that in the wiki.
You don't have to run a script to do anything; 2.10 is the default.

 Create instructions on fully building Spark in Intellij
 ---

 Key: SPARK-4128
 URL: https://issues.apache.org/jira/browse/SPARK-4128
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.2.0


 With some of our more complicated modules, I'm not sure whether Intellij 
 correctly understands all source locations. Also, we might require specifying 
 some profiles for the build to work directly. We should document clearly how 
 to start with vanilla Spark master and get the entire thing building in 
 Intellij.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7572) Move Param and Params to ml.param in PySpark

2015-05-12 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-7572:


 Summary: Move Param and Params to ml.param in PySpark
 Key: SPARK-7572
 URL: https://issues.apache.org/jira/browse/SPARK-7572
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


to match Scala namespaces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7572) Move Param and Params to ml.param in PySpark

2015-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540617#comment-14540617
 ] 

Apache Spark commented on SPARK-7572:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/6094

 Move Param and Params to ml.param in PySpark
 

 Key: SPARK-7572
 URL: https://issues.apache.org/jira/browse/SPARK-7572
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 to match Scala namespaces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7572) Move Param and Params to ml.param in PySpark

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7572:
---

Assignee: Xiangrui Meng  (was: Apache Spark)

 Move Param and Params to ml.param in PySpark
 

 Key: SPARK-7572
 URL: https://issues.apache.org/jira/browse/SPARK-7572
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 to match Scala namespaces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7572) Move Param and Params to ml.param in PySpark

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7572:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

 Move Param and Params to ml.param in PySpark
 

 Key: SPARK-7572
 URL: https://issues.apache.org/jira/browse/SPARK-7572
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Apache Spark

 to match Scala namespaces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7552) Close files correctly when iteration is finished in WAL recovery

2015-05-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7552:
-
Labels:   (was: backport-needed)

 Close files correctly when iteration is finished in WAL recovery
 

 Key: SPARK-7552
 URL: https://issues.apache.org/jira/browse/SPARK-7552
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.1, 1.4.0
Reporter: Saisai Shao
Assignee: Saisai Shao
 Fix For: 1.3.2, 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7552) Close files correctly when iteration is finished in WAL recovery

2015-05-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7552.
--
   Resolution: Pending Closed
Fix Version/s: 1.3.2
 Assignee: Saisai Shao

 Close files correctly when iteration is finished in WAL recovery
 

 Key: SPARK-7552
 URL: https://issues.apache.org/jira/browse/SPARK-7552
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.1, 1.4.0
Reporter: Saisai Shao
Assignee: Saisai Shao
 Fix For: 1.3.2, 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4128) Create instructions on fully building Spark in Intellij

2015-05-12 Thread Christian Kadner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540397#comment-14540397
 ] 

Christian Kadner commented on SPARK-4128:
-

Not every user may care about each of the modules, and yes, these instructions 
may need to be revised.

Yet I strongly think there should be some general text, maybe under Other 
Tips, that explains the need to manually update the Module settings to mark 
additional folders as Source folders (after selecting the right combination of 
Profiles and doing a Generate Sources 

For spark-hive this seems to still be true.

Patrick had written this comment in one of his emails, which are helpful to 
understand why that needs to be done.

 In some cases in the maven build we now have pluggable source
 directories based on profiles using the maven build helper plug-in.
 This is necessary to support cross building against different Hive
 versions, and there will be additional instances of this due to
 supporting scala 2.11 and 2.10.

 In these cases, you may need to add source locations explicitly to
 intellij if you want the entire project to compile there.

 Unfortunately as long as we support cross-building like this, it will
 be an issue. Intellij's maven support does not correctly detect our
 use of the maven-build-plugin to add source directories.

Besides fixing the module settings for spark-hive, I had to change the 
flume-sink module settings to mark 
target\scala-2.10\src_managed\main\compiled_avro folder as additional Source 
Folder.



 Create instructions on fully building Spark in Intellij
 ---

 Key: SPARK-4128
 URL: https://issues.apache.org/jira/browse/SPARK-4128
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.2.0


 With some of our more complicated modules, I'm not sure whether Intellij 
 correctly understands all source locations. Also, we might require specifying 
 some profiles for the build to work directly. We should document clearly how 
 to start with vanilla Spark master and get the entire thing building in 
 Intellij.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5707) Enabling spark.sql.codegen throws ClassNotFound exception

2015-05-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5707:
---
Assignee: Ram Sriharsha

 Enabling spark.sql.codegen throws ClassNotFound exception
 -

 Key: SPARK-5707
 URL: https://issues.apache.org/jira/browse/SPARK-5707
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.1
 Environment: yarn-client mode, spark.sql.codegen=true
Reporter: Yi Yao
Assignee: Ram Sriharsha
Priority: Blocker

 Exception thrown:
 {noformat}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in 
 stage 133.0 failed 4 times, most recent failure: Lost task 13.3 in stage 
 133.0 (TID 3066, cdh52-node2): java.io.IOException: 
 com.esotericsoftware.kryo.KryoException: Unable to find class: 
 __wrapper$1$81257352e1c844aebf09cb84fe9e7459.__wrapper$1$81257352e1c844aebf09cb84fe9e7459$SpecificRow$1
 Serialization trace:
 hashTable (org.apache.spark.sql.execution.joins.UniqueKeyHashedRelation)
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
 at 
 org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
 at 
 org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
 at 
 org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
 at 
 org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
 at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
 at 
 org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:62)
 at 
 org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:61)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at 

[jira] [Created] (SPARK-7570) Ignore _temporary folders during partition discovery

2015-05-12 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-7570:
-

 Summary: Ignore _temporary folders during partition discovery
 Key: SPARK-7570
 URL: https://issues.apache.org/jira/browse/SPARK-7570
 Project: Spark
  Issue Type: Improvement
Reporter: Cheng Lian
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7425) spark.ml Predictor should support other numeric types for label

2015-05-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540585#comment-14540585
 ] 

Joseph K. Bradley commented on SPARK-7425:
--

Should we not just support all NumericType sub-types?

 spark.ml Predictor should support other numeric types for label
 ---

 Key: SPARK-7425
 URL: https://issues.apache.org/jira/browse/SPARK-7425
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 Currently, the Predictor abstraction expects the input labelCol type to be 
 DoubleType, but we should support other numeric types.  This will involve 
 updating the PredictorParams.validateAndTransformSchema method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5446) Parquet column pruning should work for Map and Struct

2015-05-12 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540609#comment-14540609
 ] 

Michael Armbrust commented on SPARK-5446:
-

Can you post the query execution for all four versions of the query?

 Parquet column pruning should work for Map and Struct
 -

 Key: SPARK-5446
 URL: https://issues.apache.org/jira/browse/SPARK-5446
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Jianshi Huang

 Consider the following query:
 {code:sql}
 select stddev_pop(variables.var1) stddev
 from model
 group by model_name
 {code}
 Where variables is a Struct containing many fields, similarly it can be a Map 
 with many key-value pairs.
 During execution, SparkSQL will shuffle the whole map or struct column 
 instead of extracting the value first. The performance is very poor.
 The optimized version could use a subquery:
 {code:sql}
 select stddev_pop(var) stddev
 from (select variables.var1 as var, model_name from model) m
 group by model_name
 {code}
 Where we extract the field/key-value only in the mapper side, so data being 
 shuffled is small.
 A benchmark for a table with 600 variables shows drastic improvment in 
 runtime:
 || || Parquet (using Map) || Parquet (using Struct) ||
 | Stddev (unoptimized) |  12890s |583s |
 | Stddev (optimized)| 84s |   61s |
 Parquet already supports reading a single field/key-value in the storage 
 level, but SparkSQL currently doesn’t have optimization for it. This will be 
 very useful optimization for tables having Map or Struct with many columns. 
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7568) ml.LogisticRegression doesn't output the right prediction

2015-05-12 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-7568:


 Summary: ml.LogisticRegression doesn't output the right prediction
 Key: SPARK-7568
 URL: https://issues.apache.org/jira/browse/SPARK-7568
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: DB Tsai
Priority: Blocker


`bin/spark-submit 
examples/src/main/python/ml/simple_text_classification_pipeline.py`

{code}
Row(id=4, text=u'spark i j k', words=[u'spark', u'i', u'j', u'k'], 
features=SparseVector(262144, {105: 1.0, 106: 1.0, 107: 1.0, 62173: 1.0}), 
rawPrediction=DenseVector([0.1629, -0.1629]), probability=DenseVector([0.5406, 
0.4594]), prediction=0.0)
Row(id=5, text=u'l m n', words=[u'l', u'm', u'n'], 
features=SparseVector(262144, {108: 1.0, 109: 1.0, 110: 1.0}), 
rawPrediction=DenseVector([2.6407, -2.6407]), probability=DenseVector([0.9334, 
0.0666]), prediction=0.0)
Row(id=6, text=u'mapreduce spark', words=[u'mapreduce', u'spark'], 
features=SparseVector(262144, {62173: 1.0, 140738: 1.0}), 
rawPrediction=DenseVector([1.2651, -1.2651]), probability=DenseVector([0.7799, 
0.2201]), prediction=0.0)
Row(id=7, text=u'apache hadoop', words=[u'apache', u'hadoop'], 
features=SparseVector(262144, {128334: 1.0, 134181: 1.0}), 
rawPrediction=DenseVector([3.7429, -3.7429]), probability=DenseVector([0.9769, 
0.0231]), prediction=0.0)
{code}

All predictions are 0, while some should be one based on the probability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7567) Migrating Parquet data source to FSBasedRelation

2015-05-12 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-7567:
--
  Component/s: SQL
 Target Version/s: 1.4.0
Affects Version/s: 1.4.0
 Assignee: Cheng Lian

 Migrating Parquet data source to FSBasedRelation
 

 Key: SPARK-7567
 URL: https://issues.apache.org/jira/browse/SPARK-7567
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7569) Improve error for binary expressions

2015-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540440#comment-14540440
 ] 

Apache Spark commented on SPARK-7569:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/6089

 Improve error for binary expressions
 

 Key: SPARK-7569
 URL: https://issues.apache.org/jira/browse/SPARK-7569
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical

 This is not a great error:
 {code}
 scala Seq((1,1)).toDF(a, b).select(lit(1) + new java.sql.Date(1)) 
 org.apache.spark.sql.AnalysisException: invalid expression (1 + 0) between 
 Literal 1, IntegerType and Literal 0, DateType;
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7570) Ignore _temporary folders during partition discovery

2015-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540462#comment-14540462
 ] 

Apache Spark commented on SPARK-7570:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/6091

 Ignore _temporary folders during partition discovery
 

 Key: SPARK-7570
 URL: https://issues.apache.org/jira/browse/SPARK-7570
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1, 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Critical

 When speculation is turned on, directories named {{_temporary}} may be left 
 in data directories after saving a DataFrame. These directories should be 
 ignored. Currently they simply fail partition discovery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7570) Ignore _temporary folders during partition discovery

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7570:
---

Assignee: Apache Spark  (was: Cheng Lian)

 Ignore _temporary folders during partition discovery
 

 Key: SPARK-7570
 URL: https://issues.apache.org/jira/browse/SPARK-7570
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1, 1.4.0
Reporter: Cheng Lian
Assignee: Apache Spark
Priority: Critical

 When speculation is turned on, directories named {{_temporary}} may be left 
 in data directories after saving a DataFrame. These directories should be 
 ignored. Currently they simply fail partition discovery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7570) Ignore _temporary folders during partition discovery

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7570:
---

Assignee: Cheng Lian  (was: Apache Spark)

 Ignore _temporary folders during partition discovery
 

 Key: SPARK-7570
 URL: https://issues.apache.org/jira/browse/SPARK-7570
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1, 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Critical

 When speculation is turned on, directories named {{_temporary}} may be left 
 in data directories after saving a DataFrame. These directories should be 
 ignored. Currently they simply fail partition discovery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7425) spark.ml Predictor should support other numeric types for label

2015-05-12 Thread Glenn Weidner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540558#comment-14540558
 ] 

Glenn Weidner commented on SPARK-7425:
--

Working on adding support at the second TODO in 
ml.Predictor.validateAndTransformSchema for the following spark.sql.types:
DecimalType, FloatType, IntegerType, LongType, ShortType.

 spark.ml Predictor should support other numeric types for label
 ---

 Key: SPARK-7425
 URL: https://issues.apache.org/jira/browse/SPARK-7425
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 Currently, the Predictor abstraction expects the input labelCol type to be 
 DoubleType, but we should support other numeric types.  This will involve 
 updating the PredictorParams.validateAndTransformSchema method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7557) User guide update for feature transformer: HashingTF, Tokenizer

2015-05-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7557:
-
Summary: User guide update for feature transformer: HashingTF, Tokenizer  
(was: User guide update for feature transformer: HashingTF)

 User guide update for feature transformer: HashingTF, Tokenizer
 ---

 Key: SPARK-7557
 URL: https://issues.apache.org/jira/browse/SPARK-7557
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2018) Big-Endian (IBM Power7) Spark Serialization issue

2015-05-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2018.
--
   Resolution: Pending Closed
Fix Version/s: 1.4.0
   1.3.2

Issue resolved by pull request 6077
[https://github.com/apache/spark/pull/6077]

 Big-Endian (IBM Power7)  Spark Serialization issue
 --

 Key: SPARK-2018
 URL: https://issues.apache.org/jira/browse/SPARK-2018
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
 Environment: hardware : IBM Power7
 OS:Linux version 2.6.32-358.el6.ppc64 
 (mockbu...@ppc-017.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
 Hat 4.4.7-3) (GCC) ) #1 SMP Tue Jan 29 11:43:27 EST 2013
 JDK: Java(TM) SE Runtime Environment (build pxp6470sr5-20130619_01(SR5))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux ppc64-64 Compressed References 
 20130617_152572 (JIT enabled, AOT enabled)
 Hadoop:Hadoop-0.2.3-CDH5.0
 Spark:Spark-1.0.0 or Spark-0.9.1
 spark-env.sh:
 export JAVA_HOME=/opt/ibm/java-ppc64-70/
 export SPARK_MASTER_IP=9.114.34.69
 export SPARK_WORKER_MEMORY=1m
 export SPARK_CLASSPATH=/home/test1/spark-1.0.0-bin-hadoop2/lib
 export  STANDALONE_SPARK_MASTER_HOST=9.114.34.69
 #export SPARK_JAVA_OPTS=' -Xdebug 
 -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n '
Reporter: Yanjie Gao
 Fix For: 1.3.2, 1.4.0


 We have an application run on Spark on Power7 System .
 But we meet an important issue about serialization.
 The example HdfsWordCount can meet the problem.
 ./bin/run-example  org.apache.spark.examples.streaming.HdfsWordCount 
 localdir
 We used Power7 (Big-Endian arch) and Redhat  6.4.
 Big-Endian  is the main cause since the example ran successfully in another 
 Power-based Little Endian setup.
 here is the exception stack and log:
 Spark Executor Command: /opt/ibm/java-ppc64-70//bin/java -cp 
 /home/test1/spark-1.0.0-bin-hadoop2/lib::/home/test1/src/spark-1.0.0-bin-hadoop2/conf:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/
  -XX:MaxPermSize=128m  -Xdebug 
 -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n -Xms512M 
 -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend 
 akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler 2 
 p7hvs7br16 4 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker 
 app-20140604023054-
 
 14/06/04 02:31:20 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 14/06/04 02:31:21 INFO spark.SecurityManager: Changing view acls to: 
 test1,yifeng
 14/06/04 02:31:21 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(test1, yifeng)
 14/06/04 02:31:22 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/06/04 02:31:22 INFO Remoting: Starting remoting
 14/06/04 02:31:22 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkExecutor@p7hvs7br16:39658]
 14/06/04 02:31:22 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://sparkExecutor@p7hvs7br16:39658]
 14/06/04 02:31:22 INFO executor.CoarseGrainedExecutorBackend: Connecting to 
 driver: akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler
 14/06/04 02:31:22 INFO worker.WorkerWatcher: Connecting to worker 
 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker
 14/06/04 02:31:23 INFO worker.WorkerWatcher: Successfully connected to 
 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker
 14/06/04 02:31:24 INFO executor.CoarseGrainedExecutorBackend: Successfully 
 registered with driver
 14/06/04 02:31:24 INFO spark.SecurityManager: Changing view acls to: 
 test1,yifeng
 14/06/04 02:31:24 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(test1, yifeng)
 14/06/04 02:31:24 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/06/04 02:31:24 INFO Remoting: Starting remoting
 14/06/04 02:31:24 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://spark@p7hvs7br16:58990]
 14/06/04 02:31:24 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://spark@p7hvs7br16:58990]
 14/06/04 02:31:24 INFO spark.SparkEnv: Connecting to MapOutputTracker: 
 akka.tcp://spark@9.186.105.141:60253/user/MapOutputTracker
 14/06/04 02:31:25 INFO spark.SparkEnv: Connecting to BlockManagerMaster: 
 

[jira] [Assigned] (SPARK-7557) User guide update for feature transformer: HashingTF, Tokenizer

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7557:
---

Assignee: Apache Spark  (was: Joseph K. Bradley)

 User guide update for feature transformer: HashingTF, Tokenizer
 ---

 Key: SPARK-7557
 URL: https://issues.apache.org/jira/browse/SPARK-7557
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Apache Spark

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7555) User guide update for ElasticNet

2015-05-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540602#comment-14540602
 ] 

Joseph K. Bradley commented on SPARK-7555:
--

Note: I created a new subsection for links to spark.ml-specific guides in this 
JIRA's PR: [SPARK-7557].  For new algorithms like ElasticNet, we can add 
similar new subsections/links as needed.

 User guide update for ElasticNet
 

 Key: SPARK-7555
 URL: https://issues.apache.org/jira/browse/SPARK-7555
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Joseph K. Bradley
Assignee: DB Tsai

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7557) User guide update for feature transformer: HashingTF, Tokenizer

2015-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540601#comment-14540601
 ] 

Apache Spark commented on SPARK-7557:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/6093

 User guide update for feature transformer: HashingTF, Tokenizer
 ---

 Key: SPARK-7557
 URL: https://issues.apache.org/jira/browse/SPARK-7557
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7570) Ignore _temporary folders during partition discovery

2015-05-12 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-7570:
--
  Component/s: SQL
  Description: When speculation is turned on, directories named 
{{_temporary}} may be left in data directories after saving a DataFrame. These 
directories should be ignored. Currently they simply fail partition discovery.
 Target Version/s: 1.4.0
Affects Version/s: 1.4.0
   1.3.1
 Assignee: Cheng Lian

 Ignore _temporary folders during partition discovery
 

 Key: SPARK-7570
 URL: https://issues.apache.org/jira/browse/SPARK-7570
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1, 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Critical

 When speculation is turned on, directories named {{_temporary}} may be left 
 in data directories after saving a DataFrame. These directories should be 
 ignored. Currently they simply fail partition discovery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6980) Akka timeout exceptions indicate which conf controls them

2015-05-12 Thread Harsh Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540335#comment-14540335
 ] 

Harsh Gupta edited comment on SPARK-6980 at 5/12/15 6:48 PM:
-

[~bryanc] [~irashid] can you update with the progress so that we can share the 
work load ? I created my own PR later but realised most of the work has been 
done by Bryan in his PR commits . Is there anyway I can merge his PR and work 
in parallel with Bryan ? 


was (Author: harshg):
[~bryanc] can you update with the progress so that we can share the work load ?

 Akka timeout exceptions indicate which conf controls them
 -

 Key: SPARK-6980
 URL: https://issues.apache.org/jira/browse/SPARK-6980
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Imran Rashid
Assignee: Harsh Gupta
Priority: Minor
  Labels: starter
 Attachments: Spark-6980-Test.scala


 If you hit one of the akka timeouts, you just get an exception like
 {code}
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 {code}
 The exception doesn't indicate how to change the timeout, though there is 
 usually (always?) a corresponding setting in {{SparkConf}} .  It would be 
 nice if the exception including the relevant setting.
 I think this should be pretty easy to do -- we just need to create something 
 like a {{NamedTimeout}}.  It would have its own {{await}} method, catches the 
 akka timeout and throws its own exception.  We should change 
 {{RpcUtils.askTimeout}} and {{RpcUtils.lookupTimeout}} to always give a 
 {{NamedTimeout}}, so we can be sure that anytime we have a timeout, we get a 
 better exception.
 Given the latest refactoring to the rpc layer, this needs to be done in both 
 {{AkkaUtils}} and {{AkkaRpcEndpoint}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7276) withColumn is very slow on dataframe with large number of columns

2015-05-12 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-7276.
-
   Resolution: Pending Closed
Fix Version/s: 1.4.0

Issue resolved by pull request 5831
[https://github.com/apache/spark/pull/5831]

 withColumn is very slow on dataframe with large number of columns
 -

 Key: SPARK-7276
 URL: https://issues.apache.org/jira/browse/SPARK-7276
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.1
Reporter: Alexandre CLEMENT
Assignee: Wenchen Fan
 Fix For: 1.4.0


 The code snippet demonstrates the problem.
 {code}
 import org.apache.spark.sql._
 import org.apache.spark.sql.types._
 val sparkConf = new SparkConf().setAppName(Spark 
 Test).setMaster(System.getProperty(spark.master, local[4]))
 val sc = new SparkContext(sparkConf)
 val sqlContext = new SQLContext(sc)
 import sqlContext.implicits._
 val custs = Seq(
   Row(1, Bob, 21, 80.5),
   Row(2, Bobby, 21, 80.5),
   Row(3, Jean, 21, 80.5),
   Row(4, Fatime, 21, 80.5)
 )
 var fields = List(
   StructField(id, IntegerType, true),
   StructField(a, IntegerType, true),
   StructField(b, StringType, true),
   StructField(target, DoubleType, false))
 val schema = StructType(fields)
 var rdd = sc.parallelize(custs)
 var df = sqlContext.createDataFrame(rdd, schema)
 for (i - 1 to 200) {
   val now = System.currentTimeMillis
   df = df.withColumn(a_new_col_ + i, df(a) + i)
   println(s$i -  + (System.currentTimeMillis - now))
 }
 df.show()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7531) Install GPG on Jenkins machines

2015-05-12 Thread shane knapp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp resolved SPARK-7531.

Resolution: Pending Closed

it was already installed on all hosts, we're g2g

 Install GPG on Jenkins machines
 ---

 Key: SPARK-7531
 URL: https://issues.apache.org/jira/browse/SPARK-7531
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: shane knapp

 This one is also required for us to cut regular snapshot releases from 
 Jenkins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7422) Add argmax to Vector, SparseVector

2015-05-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540509#comment-14540509
 ] 

Joseph K. Bradley commented on SPARK-7422:
--

Great!  Just to confirm: Can you please do separate PRs for this JIRA and the 
related one you're working on?

 Add argmax to Vector, SparseVector
 --

 Key: SPARK-7422
 URL: https://issues.apache.org/jira/browse/SPARK-7422
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 DenseVector has an argmax method which is currently private to Spark.  It 
 would be nice to add that method to Vector and SparseVector.  Adding it to 
 SparseVector would require being careful about handling the inactive elements 
 correctly and efficiently.
 We should make argmax public and add unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2018) Big-Endian (IBM Power7) Spark Serialization issue

2015-05-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2018:
-
Assignee: Tim Ellison

 Big-Endian (IBM Power7)  Spark Serialization issue
 --

 Key: SPARK-2018
 URL: https://issues.apache.org/jira/browse/SPARK-2018
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
 Environment: hardware : IBM Power7
 OS:Linux version 2.6.32-358.el6.ppc64 
 (mockbu...@ppc-017.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
 Hat 4.4.7-3) (GCC) ) #1 SMP Tue Jan 29 11:43:27 EST 2013
 JDK: Java(TM) SE Runtime Environment (build pxp6470sr5-20130619_01(SR5))
 IBM J9 VM (build 2.6, JRE 1.7.0 Linux ppc64-64 Compressed References 
 20130617_152572 (JIT enabled, AOT enabled)
 Hadoop:Hadoop-0.2.3-CDH5.0
 Spark:Spark-1.0.0 or Spark-0.9.1
 spark-env.sh:
 export JAVA_HOME=/opt/ibm/java-ppc64-70/
 export SPARK_MASTER_IP=9.114.34.69
 export SPARK_WORKER_MEMORY=1m
 export SPARK_CLASSPATH=/home/test1/spark-1.0.0-bin-hadoop2/lib
 export  STANDALONE_SPARK_MASTER_HOST=9.114.34.69
 #export SPARK_JAVA_OPTS=' -Xdebug 
 -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n '
Reporter: Yanjie Gao
Assignee: Tim Ellison
 Fix For: 1.3.2, 1.4.0


 We have an application run on Spark on Power7 System .
 But we meet an important issue about serialization.
 The example HdfsWordCount can meet the problem.
 ./bin/run-example  org.apache.spark.examples.streaming.HdfsWordCount 
 localdir
 We used Power7 (Big-Endian arch) and Redhat  6.4.
 Big-Endian  is the main cause since the example ran successfully in another 
 Power-based Little Endian setup.
 here is the exception stack and log:
 Spark Executor Command: /opt/ibm/java-ppc64-70//bin/java -cp 
 /home/test1/spark-1.0.0-bin-hadoop2/lib::/home/test1/src/spark-1.0.0-bin-hadoop2/conf:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/
  -XX:MaxPermSize=128m  -Xdebug 
 -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n -Xms512M 
 -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend 
 akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler 2 
 p7hvs7br16 4 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker 
 app-20140604023054-
 
 14/06/04 02:31:20 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 14/06/04 02:31:21 INFO spark.SecurityManager: Changing view acls to: 
 test1,yifeng
 14/06/04 02:31:21 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(test1, yifeng)
 14/06/04 02:31:22 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/06/04 02:31:22 INFO Remoting: Starting remoting
 14/06/04 02:31:22 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkExecutor@p7hvs7br16:39658]
 14/06/04 02:31:22 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://sparkExecutor@p7hvs7br16:39658]
 14/06/04 02:31:22 INFO executor.CoarseGrainedExecutorBackend: Connecting to 
 driver: akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler
 14/06/04 02:31:22 INFO worker.WorkerWatcher: Connecting to worker 
 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker
 14/06/04 02:31:23 INFO worker.WorkerWatcher: Successfully connected to 
 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker
 14/06/04 02:31:24 INFO executor.CoarseGrainedExecutorBackend: Successfully 
 registered with driver
 14/06/04 02:31:24 INFO spark.SecurityManager: Changing view acls to: 
 test1,yifeng
 14/06/04 02:31:24 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(test1, yifeng)
 14/06/04 02:31:24 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/06/04 02:31:24 INFO Remoting: Starting remoting
 14/06/04 02:31:24 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://spark@p7hvs7br16:58990]
 14/06/04 02:31:24 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://spark@p7hvs7br16:58990]
 14/06/04 02:31:24 INFO spark.SparkEnv: Connecting to MapOutputTracker: 
 akka.tcp://spark@9.186.105.141:60253/user/MapOutputTracker
 14/06/04 02:31:25 INFO spark.SparkEnv: Connecting to BlockManagerMaster: 
 akka.tcp://spark@9.186.105.141:60253/user/BlockManagerMaster
 14/06/04 02:31:25 INFO storage.DiskBlockManager: Created local directory at 
 

[jira] [Commented] (SPARK-7556) User guide update for feature transformer: Binarizer

2015-05-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540607#comment-14540607
 ] 

Joseph K. Bradley commented on SPARK-7556:
--

Note: I created a new subsection for links to spark.ml-specific guides in this 
JIRA's PR: [SPARK-7557].  Binarizer can go within the new subsection.  I'll try 
to get that PR merged ASAP.  Thanks!

 User guide update for feature transformer: Binarizer
 

 Key: SPARK-7556
 URL: https://issues.apache.org/jira/browse/SPARK-7556
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Liang-Chi Hsieh

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4128) Create instructions on fully building Spark in Intellij

2015-05-12 Thread Christian Kadner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540397#comment-14540397
 ] 

Christian Kadner edited comment on SPARK-4128 at 5/12/15 6:23 PM:
--

Not every user may care about each of the modules, and yes, these instructions 
may need to be revised.

Yet I strongly think there should be some general text, maybe under Other 
Tips, that explains the need to manually update the Module settings to mark 
additional folders as Source folders (after selecting the right combination of 
Profiles and doing a Generate Sources 

For spark-hive this seems to still be true.

Patrick had written this comment in one of his emails, which is helpful to 
understand why that needs to be done.

 In some cases in the maven build we now have pluggable source
 directories based on profiles using the maven build helper plug-in.
 This is necessary to support cross building against different Hive
 versions, and there will be additional instances of this due to
 supporting scala 2.11 and 2.10.

 In these cases, you may need to add source locations explicitly to
 intellij if you want the entire project to compile there.

 Unfortunately as long as we support cross-building like this, it will
 be an issue. Intellij's maven support does not correctly detect our
 use of the maven-build-plugin to add source directories.

Besides fixing the module settings for spark-hive, I had to change the 
flume-sink module settings to mark 
target\scala-2.10\src_managed\main\compiled_avro folder as additional Source 
Folder.




was (Author: ckadner):
Not every user may care about each of the modules, and yes, these instructions 
may need to be revised.

Yet I strongly think there should be some general text, maybe under Other 
Tips, that explains the need to manually update the Module settings to mark 
additional folders as Source folders (after selecting the right combination of 
Profiles and doing a Generate Sources 

For spark-hive this seems to still be true.

Patrick had written this comment in one of his emails, which are helpful to 
understand why that needs to be done.

 In some cases in the maven build we now have pluggable source
 directories based on profiles using the maven build helper plug-in.
 This is necessary to support cross building against different Hive
 versions, and there will be additional instances of this due to
 supporting scala 2.11 and 2.10.

 In these cases, you may need to add source locations explicitly to
 intellij if you want the entire project to compile there.

 Unfortunately as long as we support cross-building like this, it will
 be an issue. Intellij's maven support does not correctly detect our
 use of the maven-build-plugin to add source directories.

Besides fixing the module settings for spark-hive, I had to change the 
flume-sink module settings to mark 
target\scala-2.10\src_managed\main\compiled_avro folder as additional Source 
Folder.



 Create instructions on fully building Spark in Intellij
 ---

 Key: SPARK-4128
 URL: https://issues.apache.org/jira/browse/SPARK-4128
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.2.0


 With some of our more complicated modules, I'm not sure whether Intellij 
 correctly understands all source locations. Also, we might require specifying 
 some profiles for the build to work directly. We should document clearly how 
 to start with vanilla Spark master and get the entire thing building in 
 Intellij.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7569) Improve error for binary expressions

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7569:
---

Assignee: Apache Spark  (was: Michael Armbrust)

 Improve error for binary expressions
 

 Key: SPARK-7569
 URL: https://issues.apache.org/jira/browse/SPARK-7569
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Apache Spark
Priority: Critical

 This is not a great error:
 {code}
 scala Seq((1,1)).toDF(a, b).select(lit(1) + new java.sql.Date(1)) 
 org.apache.spark.sql.AnalysisException: invalid expression (1 + 0) between 
 Literal 1, IntegerType and Literal 0, DateType;
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7569) Improve error for binary expressions

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7569:
---

Assignee: Michael Armbrust  (was: Apache Spark)

 Improve error for binary expressions
 

 Key: SPARK-7569
 URL: https://issues.apache.org/jira/browse/SPARK-7569
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical

 This is not a great error:
 {code}
 scala Seq((1,1)).toDF(a, b).select(lit(1) + new java.sql.Date(1)) 
 org.apache.spark.sql.AnalysisException: invalid expression (1 + 0) between 
 Literal 1, IntegerType and Literal 0, DateType;
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7567) Migrating Parquet data source to FSBasedRelation

2015-05-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540450#comment-14540450
 ] 

Apache Spark commented on SPARK-7567:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/6090

 Migrating Parquet data source to FSBasedRelation
 

 Key: SPARK-7567
 URL: https://issues.apache.org/jira/browse/SPARK-7567
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7567) Migrating Parquet data source to FSBasedRelation

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7567:
---

Assignee: Apache Spark  (was: Cheng Lian)

 Migrating Parquet data source to FSBasedRelation
 

 Key: SPARK-7567
 URL: https://issues.apache.org/jira/browse/SPARK-7567
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Apache Spark
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7571) Rename `Math` to `math` in MLlib's Scala code

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7571:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

 Rename `Math` to `math` in MLlib's Scala code
 -

 Key: SPARK-7571
 URL: https://issues.apache.org/jira/browse/SPARK-7571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Apache Spark
Priority: Trivial

 scala.Math was deprecated since 2.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7487) Python API for ml.regression

2015-05-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7487:
-
Assignee: Burak Yavuz

 Python API for ml.regression
 

 Key: SPARK-7487
 URL: https://issues.apache.org/jira/browse/SPARK-7487
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Burak Yavuz
Assignee: Burak Yavuz
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7557) User guide update for feature transformer: HashingTF, Tokenizer

2015-05-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7557:
---

Assignee: Joseph K. Bradley  (was: Apache Spark)

 User guide update for feature transformer: HashingTF, Tokenizer
 ---

 Key: SPARK-7557
 URL: https://issues.apache.org/jira/browse/SPARK-7557
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7568) ml.LogisticRegression doesn't output the right prediction

2015-05-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7568:
-
Description: 
`bin/spark-submit 
examples/src/main/python/ml/simple_text_classification_pipeline.py`

{code}
Row(id=4, text=u'spark i j k', words=[u'spark', u'i', u'j', u'k'], 
features=SparseVector(262144, {105: 1.0, 106: 1.0, 107: 1.0, 62173: 1.0}), 
rawPrediction=DenseVector([0.1629, -0.1629]), probability=DenseVector([0.5406, 
0.4594]), prediction=0.0)
Row(id=5, text=u'l m n', words=[u'l', u'm', u'n'], 
features=SparseVector(262144, {108: 1.0, 109: 1.0, 110: 1.0}), 
rawPrediction=DenseVector([2.6407, -2.6407]), probability=DenseVector([0.9334, 
0.0666]), prediction=0.0)
Row(id=6, text=u'mapreduce spark', words=[u'mapreduce', u'spark'], 
features=SparseVector(262144, {62173: 1.0, 140738: 1.0}), 
rawPrediction=DenseVector([1.2651, -1.2651]), probability=DenseVector([0.7799, 
0.2201]), prediction=0.0)
Row(id=7, text=u'apache hadoop', words=[u'apache', u'hadoop'], 
features=SparseVector(262144, {128334: 1.0, 134181: 1.0}), 
rawPrediction=DenseVector([3.7429, -3.7429]), probability=DenseVector([0.9769, 
0.0231]), prediction=0.0)
{code}

All predictions are 0, while some should be one based on the probability. It 
seems to be an issue with regularization.

  was:
`bin/spark-submit 
examples/src/main/python/ml/simple_text_classification_pipeline.py`

{code}
Row(id=4, text=u'spark i j k', words=[u'spark', u'i', u'j', u'k'], 
features=SparseVector(262144, {105: 1.0, 106: 1.0, 107: 1.0, 62173: 1.0}), 
rawPrediction=DenseVector([0.1629, -0.1629]), probability=DenseVector([0.5406, 
0.4594]), prediction=0.0)
Row(id=5, text=u'l m n', words=[u'l', u'm', u'n'], 
features=SparseVector(262144, {108: 1.0, 109: 1.0, 110: 1.0}), 
rawPrediction=DenseVector([2.6407, -2.6407]), probability=DenseVector([0.9334, 
0.0666]), prediction=0.0)
Row(id=6, text=u'mapreduce spark', words=[u'mapreduce', u'spark'], 
features=SparseVector(262144, {62173: 1.0, 140738: 1.0}), 
rawPrediction=DenseVector([1.2651, -1.2651]), probability=DenseVector([0.7799, 
0.2201]), prediction=0.0)
Row(id=7, text=u'apache hadoop', words=[u'apache', u'hadoop'], 
features=SparseVector(262144, {128334: 1.0, 134181: 1.0}), 
rawPrediction=DenseVector([3.7429, -3.7429]), probability=DenseVector([0.9769, 
0.0231]), prediction=0.0)
{code}

All predictions are 0, while some should be one based on the probability.


 ml.LogisticRegression doesn't output the right prediction
 -

 Key: SPARK-7568
 URL: https://issues.apache.org/jira/browse/SPARK-7568
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: DB Tsai
Priority: Blocker

 `bin/spark-submit 
 examples/src/main/python/ml/simple_text_classification_pipeline.py`
 {code}
 Row(id=4, text=u'spark i j k', words=[u'spark', u'i', u'j', u'k'], 
 features=SparseVector(262144, {105: 1.0, 106: 1.0, 107: 1.0, 62173: 1.0}), 
 rawPrediction=DenseVector([0.1629, -0.1629]), 
 probability=DenseVector([0.5406, 0.4594]), prediction=0.0)
 Row(id=5, text=u'l m n', words=[u'l', u'm', u'n'], 
 features=SparseVector(262144, {108: 1.0, 109: 1.0, 110: 1.0}), 
 rawPrediction=DenseVector([2.6407, -2.6407]), 
 probability=DenseVector([0.9334, 0.0666]), prediction=0.0)
 Row(id=6, text=u'mapreduce spark', words=[u'mapreduce', u'spark'], 
 features=SparseVector(262144, {62173: 1.0, 140738: 1.0}), 
 rawPrediction=DenseVector([1.2651, -1.2651]), 
 probability=DenseVector([0.7799, 0.2201]), prediction=0.0)
 Row(id=7, text=u'apache hadoop', words=[u'apache', u'hadoop'], 
 features=SparseVector(262144, {128334: 1.0, 134181: 1.0}), 
 rawPrediction=DenseVector([3.7429, -3.7429]), 
 probability=DenseVector([0.9769, 0.0231]), prediction=0.0)
 {code}
 All predictions are 0, while some should be one based on the probability. It 
 seems to be an issue with regularization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >