[jira] [Commented] (SPARK-7422) Add argmax to Vector, SparseVector
[ https://issues.apache.org/jira/browse/SPARK-7422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539314#comment-14539314 ] George Dittmar commented on SPARK-7422: --- Finishing tests for this JIRA with PR inbound soon. Add argmax to Vector, SparseVector -- Key: SPARK-7422 URL: https://issues.apache.org/jira/browse/SPARK-7422 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Priority: Minor Labels: starter DenseVector has an argmax method which is currently private to Spark. It would be nice to add that method to Vector and SparseVector. Adding it to SparseVector would require being careful about handling the inactive elements correctly and efficiently. We should make argmax public and add unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7423) spark.ml Classifier predict should not convert vectors to dense format
[ https://issues.apache.org/jira/browse/SPARK-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539311#comment-14539311 ] George Dittmar commented on SPARK-7423: --- Will have a pr for this soon. Just made the changes to the linked JIRA in another branch and finishing up tests now. spark.ml Classifier predict should not convert vectors to dense format -- Key: SPARK-7423 URL: https://issues.apache.org/jira/browse/SPARK-7423 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor Labels: starter spark.ml.classification.ClassificationModel and ProbabilisticClassificationModel both use DenseVector.argmax to implement prediction (computing the prediction from the rawPrediction or probability Vectors). It would be best to implement argmax for Vector and SparseVector and use Vector.argmax, rather than converting Vectors to dense format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7562) Improve error reporting for expression data type mismatch
[ https://issues.apache.org/jira/browse/SPARK-7562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7562: --- Description: There is currently no error reporting for expression data types in analysis (we rely on resolved for that, which doesn't provide great error messages for types). It would be great to have that in checkAnalysis. Ideally, it should be the responsibility of each Expression itself to specify the types it requires, and report errors that way. We would need to define a simple interface for that so each Expression can implement. The default implementation can just use the information provided by ExpectsInputTypes.expectedChildTypes. was: There is currently no error reporting for expression data types in analysis. It would be great to have that in checkAnalysis. Ideally, it should be the responsibility of each Expression itself to specify the types it requires, and report errors that way. We would need to define a simple interface for that so each Expression can implement. The default implementation can just use the information provided by ExpectsInputTypes.expectedChildTypes. Improve error reporting for expression data type mismatch - Key: SPARK-7562 URL: https://issues.apache.org/jira/browse/SPARK-7562 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin There is currently no error reporting for expression data types in analysis (we rely on resolved for that, which doesn't provide great error messages for types). It would be great to have that in checkAnalysis. Ideally, it should be the responsibility of each Expression itself to specify the types it requires, and report errors that way. We would need to define a simple interface for that so each Expression can implement. The default implementation can just use the information provided by ExpectsInputTypes.expectedChildTypes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7562) Improve error reporting for expression data type mismatch
[ https://issues.apache.org/jira/browse/SPARK-7562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7562: --- Description: There is currently no error reporting for expression data types in analysis (we rely on resolved for that, which doesn't provide great error messages for types). It would be great to have that in checkAnalysis. Ideally, it should be the responsibility of each Expression itself to specify the types it requires, and report errors that way. We would need to define a simple interface for that so each Expression can implement. The default implementation can just use the information provided by ExpectsInputTypes.expectedChildTypes. cc [~marmbrus] what we discussed offline today. was: There is currently no error reporting for expression data types in analysis (we rely on resolved for that, which doesn't provide great error messages for types). It would be great to have that in checkAnalysis. Ideally, it should be the responsibility of each Expression itself to specify the types it requires, and report errors that way. We would need to define a simple interface for that so each Expression can implement. The default implementation can just use the information provided by ExpectsInputTypes.expectedChildTypes. Improve error reporting for expression data type mismatch - Key: SPARK-7562 URL: https://issues.apache.org/jira/browse/SPARK-7562 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin There is currently no error reporting for expression data types in analysis (we rely on resolved for that, which doesn't provide great error messages for types). It would be great to have that in checkAnalysis. Ideally, it should be the responsibility of each Expression itself to specify the types it requires, and report errors that way. We would need to define a simple interface for that so each Expression can implement. The default implementation can just use the information provided by ExpectsInputTypes.expectedChildTypes. cc [~marmbrus] what we discussed offline today. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7500) DAG visualization: cluster name bleeds beyond the cluster
[ https://issues.apache.org/jira/browse/SPARK-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539501#comment-14539501 ] Apache Spark commented on SPARK-7500: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/6076 DAG visualization: cluster name bleeds beyond the cluster - Key: SPARK-7500 URL: https://issues.apache.org/jira/browse/SPARK-7500 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Minor Attachments: long names.png This happens only for long names. See screenshot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7500) DAG visualization: cluster name bleeds beyond the cluster
[ https://issues.apache.org/jira/browse/SPARK-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7500: --- Assignee: Apache Spark (was: Andrew Or) DAG visualization: cluster name bleeds beyond the cluster - Key: SPARK-7500 URL: https://issues.apache.org/jira/browse/SPARK-7500 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Apache Spark Priority: Minor Attachments: long names.png This happens only for long names. See screenshot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7556) User guide update for feature transformer: Binarizer
[ https://issues.apache.org/jira/browse/SPARK-7556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539382#comment-14539382 ] Liang-Chi Hsieh commented on SPARK-7556: OK. User guide update for feature transformer: Binarizer Key: SPARK-7556 URL: https://issues.apache.org/jira/browse/SPARK-7556 Project: Spark Issue Type: Documentation Components: ML Reporter: Joseph K. Bradley Assignee: Liang-Chi Hsieh Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2018) Big-Endian (IBM Power7) Spark Serialization issue
[ https://issues.apache.org/jira/browse/SPARK-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2018: --- Assignee: (was: Apache Spark) Big-Endian (IBM Power7) Spark Serialization issue -- Key: SPARK-2018 URL: https://issues.apache.org/jira/browse/SPARK-2018 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Environment: hardware : IBM Power7 OS:Linux version 2.6.32-358.el6.ppc64 (mockbu...@ppc-017.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Tue Jan 29 11:43:27 EST 2013 JDK: Java(TM) SE Runtime Environment (build pxp6470sr5-20130619_01(SR5)) IBM J9 VM (build 2.6, JRE 1.7.0 Linux ppc64-64 Compressed References 20130617_152572 (JIT enabled, AOT enabled) Hadoop:Hadoop-0.2.3-CDH5.0 Spark:Spark-1.0.0 or Spark-0.9.1 spark-env.sh: export JAVA_HOME=/opt/ibm/java-ppc64-70/ export SPARK_MASTER_IP=9.114.34.69 export SPARK_WORKER_MEMORY=1m export SPARK_CLASSPATH=/home/test1/spark-1.0.0-bin-hadoop2/lib export STANDALONE_SPARK_MASTER_HOST=9.114.34.69 #export SPARK_JAVA_OPTS=' -Xdebug -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n ' Reporter: Yanjie Gao We have an application run on Spark on Power7 System . But we meet an important issue about serialization. The example HdfsWordCount can meet the problem. ./bin/run-example org.apache.spark.examples.streaming.HdfsWordCount localdir We used Power7 (Big-Endian arch) and Redhat 6.4. Big-Endian is the main cause since the example ran successfully in another Power-based Little Endian setup. here is the exception stack and log: Spark Executor Command: /opt/ibm/java-ppc64-70//bin/java -cp /home/test1/spark-1.0.0-bin-hadoop2/lib::/home/test1/src/spark-1.0.0-bin-hadoop2/conf:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/ -XX:MaxPermSize=128m -Xdebug -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n -Xms512M -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler 2 p7hvs7br16 4 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker app-20140604023054- 14/06/04 02:31:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/06/04 02:31:21 INFO spark.SecurityManager: Changing view acls to: test1,yifeng 14/06/04 02:31:21 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(test1, yifeng) 14/06/04 02:31:22 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/06/04 02:31:22 INFO Remoting: Starting remoting 14/06/04 02:31:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@p7hvs7br16:39658] 14/06/04 02:31:22 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutor@p7hvs7br16:39658] 14/06/04 02:31:22 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler 14/06/04 02:31:22 INFO worker.WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker 14/06/04 02:31:23 INFO worker.WorkerWatcher: Successfully connected to akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker 14/06/04 02:31:24 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver 14/06/04 02:31:24 INFO spark.SecurityManager: Changing view acls to: test1,yifeng 14/06/04 02:31:24 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(test1, yifeng) 14/06/04 02:31:24 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/06/04 02:31:24 INFO Remoting: Starting remoting 14/06/04 02:31:24 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@p7hvs7br16:58990] 14/06/04 02:31:24 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@p7hvs7br16:58990] 14/06/04 02:31:24 INFO spark.SparkEnv: Connecting to MapOutputTracker: akka.tcp://spark@9.186.105.141:60253/user/MapOutputTracker 14/06/04 02:31:25 INFO spark.SparkEnv: Connecting to BlockManagerMaster: akka.tcp://spark@9.186.105.141:60253/user/BlockManagerMaster 14/06/04 02:31:25 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20140604023125-3f61 14/06/04
[jira] [Commented] (SPARK-2018) Big-Endian (IBM Power7) Spark Serialization issue
[ https://issues.apache.org/jira/browse/SPARK-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539529#comment-14539529 ] Apache Spark commented on SPARK-2018: - User 'tellison' has created a pull request for this issue: https://github.com/apache/spark/pull/6077 Big-Endian (IBM Power7) Spark Serialization issue -- Key: SPARK-2018 URL: https://issues.apache.org/jira/browse/SPARK-2018 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Environment: hardware : IBM Power7 OS:Linux version 2.6.32-358.el6.ppc64 (mockbu...@ppc-017.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Tue Jan 29 11:43:27 EST 2013 JDK: Java(TM) SE Runtime Environment (build pxp6470sr5-20130619_01(SR5)) IBM J9 VM (build 2.6, JRE 1.7.0 Linux ppc64-64 Compressed References 20130617_152572 (JIT enabled, AOT enabled) Hadoop:Hadoop-0.2.3-CDH5.0 Spark:Spark-1.0.0 or Spark-0.9.1 spark-env.sh: export JAVA_HOME=/opt/ibm/java-ppc64-70/ export SPARK_MASTER_IP=9.114.34.69 export SPARK_WORKER_MEMORY=1m export SPARK_CLASSPATH=/home/test1/spark-1.0.0-bin-hadoop2/lib export STANDALONE_SPARK_MASTER_HOST=9.114.34.69 #export SPARK_JAVA_OPTS=' -Xdebug -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n ' Reporter: Yanjie Gao We have an application run on Spark on Power7 System . But we meet an important issue about serialization. The example HdfsWordCount can meet the problem. ./bin/run-example org.apache.spark.examples.streaming.HdfsWordCount localdir We used Power7 (Big-Endian arch) and Redhat 6.4. Big-Endian is the main cause since the example ran successfully in another Power-based Little Endian setup. here is the exception stack and log: Spark Executor Command: /opt/ibm/java-ppc64-70//bin/java -cp /home/test1/spark-1.0.0-bin-hadoop2/lib::/home/test1/src/spark-1.0.0-bin-hadoop2/conf:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/ -XX:MaxPermSize=128m -Xdebug -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n -Xms512M -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler 2 p7hvs7br16 4 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker app-20140604023054- 14/06/04 02:31:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/06/04 02:31:21 INFO spark.SecurityManager: Changing view acls to: test1,yifeng 14/06/04 02:31:21 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(test1, yifeng) 14/06/04 02:31:22 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/06/04 02:31:22 INFO Remoting: Starting remoting 14/06/04 02:31:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@p7hvs7br16:39658] 14/06/04 02:31:22 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutor@p7hvs7br16:39658] 14/06/04 02:31:22 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler 14/06/04 02:31:22 INFO worker.WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker 14/06/04 02:31:23 INFO worker.WorkerWatcher: Successfully connected to akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker 14/06/04 02:31:24 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver 14/06/04 02:31:24 INFO spark.SecurityManager: Changing view acls to: test1,yifeng 14/06/04 02:31:24 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(test1, yifeng) 14/06/04 02:31:24 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/06/04 02:31:24 INFO Remoting: Starting remoting 14/06/04 02:31:24 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@p7hvs7br16:58990] 14/06/04 02:31:24 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@p7hvs7br16:58990] 14/06/04 02:31:24 INFO spark.SparkEnv: Connecting to MapOutputTracker: akka.tcp://spark@9.186.105.141:60253/user/MapOutputTracker 14/06/04 02:31:25 INFO spark.SparkEnv: Connecting to BlockManagerMaster: akka.tcp://spark@9.186.105.141:60253/user/BlockManagerMaster 14/06/04
[jira] [Assigned] (SPARK-2018) Big-Endian (IBM Power7) Spark Serialization issue
[ https://issues.apache.org/jira/browse/SPARK-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2018: --- Assignee: Apache Spark Big-Endian (IBM Power7) Spark Serialization issue -- Key: SPARK-2018 URL: https://issues.apache.org/jira/browse/SPARK-2018 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Environment: hardware : IBM Power7 OS:Linux version 2.6.32-358.el6.ppc64 (mockbu...@ppc-017.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Tue Jan 29 11:43:27 EST 2013 JDK: Java(TM) SE Runtime Environment (build pxp6470sr5-20130619_01(SR5)) IBM J9 VM (build 2.6, JRE 1.7.0 Linux ppc64-64 Compressed References 20130617_152572 (JIT enabled, AOT enabled) Hadoop:Hadoop-0.2.3-CDH5.0 Spark:Spark-1.0.0 or Spark-0.9.1 spark-env.sh: export JAVA_HOME=/opt/ibm/java-ppc64-70/ export SPARK_MASTER_IP=9.114.34.69 export SPARK_WORKER_MEMORY=1m export SPARK_CLASSPATH=/home/test1/spark-1.0.0-bin-hadoop2/lib export STANDALONE_SPARK_MASTER_HOST=9.114.34.69 #export SPARK_JAVA_OPTS=' -Xdebug -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n ' Reporter: Yanjie Gao Assignee: Apache Spark We have an application run on Spark on Power7 System . But we meet an important issue about serialization. The example HdfsWordCount can meet the problem. ./bin/run-example org.apache.spark.examples.streaming.HdfsWordCount localdir We used Power7 (Big-Endian arch) and Redhat 6.4. Big-Endian is the main cause since the example ran successfully in another Power-based Little Endian setup. here is the exception stack and log: Spark Executor Command: /opt/ibm/java-ppc64-70//bin/java -cp /home/test1/spark-1.0.0-bin-hadoop2/lib::/home/test1/src/spark-1.0.0-bin-hadoop2/conf:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/ -XX:MaxPermSize=128m -Xdebug -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n -Xms512M -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler 2 p7hvs7br16 4 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker app-20140604023054- 14/06/04 02:31:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/06/04 02:31:21 INFO spark.SecurityManager: Changing view acls to: test1,yifeng 14/06/04 02:31:21 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(test1, yifeng) 14/06/04 02:31:22 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/06/04 02:31:22 INFO Remoting: Starting remoting 14/06/04 02:31:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@p7hvs7br16:39658] 14/06/04 02:31:22 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutor@p7hvs7br16:39658] 14/06/04 02:31:22 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler 14/06/04 02:31:22 INFO worker.WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker 14/06/04 02:31:23 INFO worker.WorkerWatcher: Successfully connected to akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker 14/06/04 02:31:24 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver 14/06/04 02:31:24 INFO spark.SecurityManager: Changing view acls to: test1,yifeng 14/06/04 02:31:24 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(test1, yifeng) 14/06/04 02:31:24 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/06/04 02:31:24 INFO Remoting: Starting remoting 14/06/04 02:31:24 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@p7hvs7br16:58990] 14/06/04 02:31:24 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@p7hvs7br16:58990] 14/06/04 02:31:24 INFO spark.SparkEnv: Connecting to MapOutputTracker: akka.tcp://spark@9.186.105.141:60253/user/MapOutputTracker 14/06/04 02:31:25 INFO spark.SparkEnv: Connecting to BlockManagerMaster: akka.tcp://spark@9.186.105.141:60253/user/BlockManagerMaster 14/06/04 02:31:25 INFO storage.DiskBlockManager: Created local directory at
[jira] [Assigned] (SPARK-7551) Don't split by dot if within backticks for DataFrame attribute resolution
[ https://issues.apache.org/jira/browse/SPARK-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7551: --- Assignee: Apache Spark (was: Wenchen Fan) Don't split by dot if within backticks for DataFrame attribute resolution - Key: SPARK-7551 URL: https://issues.apache.org/jira/browse/SPARK-7551 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark Priority: Critical DataFrame's resolve: {code} protected[sql] def resolve(colName: String): NamedExpression = { queryExecution.analyzed.resolve(colName.split(\\.), sqlContext.analyzer.resolver).getOrElse { throw new AnalysisException( sCannot resolve column name $colName among (${schema.fieldNames.mkString(, )})) } } {code} We should not split the parts quoted by backticks (`). For example, `ab.cd`.`efg` should be split into two parts ab.cd and efg. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7551) Don't split by dot if within backticks for DataFrame attribute resolution
[ https://issues.apache.org/jira/browse/SPARK-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7551: --- Assignee: Wenchen Fan (was: Apache Spark) Don't split by dot if within backticks for DataFrame attribute resolution - Key: SPARK-7551 URL: https://issues.apache.org/jira/browse/SPARK-7551 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Wenchen Fan Priority: Critical DataFrame's resolve: {code} protected[sql] def resolve(colName: String): NamedExpression = { queryExecution.analyzed.resolve(colName.split(\\.), sqlContext.analyzer.resolver).getOrElse { throw new AnalysisException( sCannot resolve column name $colName among (${schema.fieldNames.mkString(, )})) } } {code} We should not split the parts quoted by backticks (`). For example, `ab.cd`.`efg` should be split into two parts ab.cd and efg. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7551) Don't split by dot if within backticks for DataFrame attribute resolution
[ https://issues.apache.org/jira/browse/SPARK-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539385#comment-14539385 ] Apache Spark commented on SPARK-7551: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/6074 Don't split by dot if within backticks for DataFrame attribute resolution - Key: SPARK-7551 URL: https://issues.apache.org/jira/browse/SPARK-7551 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Wenchen Fan Priority: Critical DataFrame's resolve: {code} protected[sql] def resolve(colName: String): NamedExpression = { queryExecution.analyzed.resolve(colName.split(\\.), sqlContext.analyzer.resolver).getOrElse { throw new AnalysisException( sCannot resolve column name $colName among (${schema.fieldNames.mkString(, )})) } } {code} We should not split the parts quoted by backticks (`). For example, `ab.cd`.`efg` should be split into two parts ab.cd and efg. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7561) Install Junit Attachment Plugin on Jenkins
Patrick Wendell created SPARK-7561: -- Summary: Install Junit Attachment Plugin on Jenkins Key: SPARK-7561 URL: https://issues.apache.org/jira/browse/SPARK-7561 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Patrick Wendell Assignee: shane knapp As part of SPARK-7560 I'd like to just attach the test output file to the Jenkins build. This is nicer than requiring someone have an SSH login to the master node. Currently we gzip the logs, copy it to the master, and then delete them on the worker. https://github.com/apache/spark/blob/master/dev/run-tests-jenkins#L132 Instead I think we can just gzip them and then have the attachment plugin add them to the build. But it would require installing this plug-in to see if we can get it working. [~shaneknapp] not sure how willing you are to install plug-ins on Jenkins, but this one would be awesome if it's doable and we can get it working. https://wiki.jenkins-ci.org/display/JENKINS/JUnit+Attachments+Plugin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7562) Improve error reporting for expression data type mismatch
Reynold Xin created SPARK-7562: -- Summary: Improve error reporting for expression data type mismatch Key: SPARK-7562 URL: https://issues.apache.org/jira/browse/SPARK-7562 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin There is currently no error reporting for expression data types in analysis. It would be great to have that in checkAnalysis. Ideally, it should be the responsibility of each Expression itself to specify the types it requires, and report errors that way. We would need to define a simple interface for that so each Expression can implement. The default implementation can just use the information provided by ExpectsInputTypes.expectedChildTypes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7548) Add explode expression
[ https://issues.apache.org/jira/browse/SPARK-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539460#comment-14539460 ] Reynold Xin commented on SPARK-7548: cc [~cloud_fan] if you have time to do this today, try to take it over. Add explode expression -- Key: SPARK-7548 URL: https://issues.apache.org/jira/browse/SPARK-7548 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7534) Fix the Stage table when a stage is missing
[ https://issues.apache.org/jira/browse/SPARK-7534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-7534. Resolution: Pending Closed Fix Version/s: 1.4.0 Assignee: Shixiong Zhu Fix the Stage table when a stage is missing --- Key: SPARK-7534 URL: https://issues.apache.org/jira/browse/SPARK-7534 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Minor Fix For: 1.4.0 Just improved the Stage table when a stage is missing. Please see the screenshots in https://github.com/apache/spark/pull/6061 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7467) DAG visualization: handle checkpoint correctly
[ https://issues.apache.org/jira/browse/SPARK-7467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-7467. Resolution: Fixed Fix Version/s: 1.4.0 DAG visualization: handle checkpoint correctly -- Key: SPARK-7467 URL: https://issues.apache.org/jira/browse/SPARK-7467 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or Fix For: 1.4.0 We need to wrap RDD#doCheckpoint in a scope. Otherwise CheckpointRDDs may belong to other operators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7558) Log test name when starting and finishing each test
Patrick Wendell created SPARK-7558: -- Summary: Log test name when starting and finishing each test Key: SPARK-7558 URL: https://issues.apache.org/jira/browse/SPARK-7558 Project: Spark Issue Type: Improvement Components: Tests Reporter: Patrick Wendell Assignee: Andrew Or Right now it's really tough to interpret testing output because logs for different tests are interspersed in the same unit-tests.log file. This makes it particularly hard to diagnose flaky tests. This would be much easier if we logged the test name before and after every test (e.g. Starting test XX, Finished test XX). Then you could get right to the logs. I think one way to do this might be to create a custom test fixture that logs the test class name and then mix that into every test suite /cc [~joshrosen] for his superb knowledge of Scalatest. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7559) Bucketizer should include the right most boundary in the last bucket.
Xiangrui Meng created SPARK-7559: Summary: Bucketizer should include the right most boundary in the last bucket. Key: SPARK-7559 URL: https://issues.apache.org/jira/browse/SPARK-7559 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Now we use special treatment for +inf. This could be simplified by including the largest split value in the last bucket. E.g., (x1, x2, x3) defines buckets [x1, x2) and [x2, x3]. This shouldn't affect user code much, and there are applications that need to include the right-most value. For example, we can bucketize ratings from 0 to 10 to bad, neutral, and good with splits 0, 4, 6, 10. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7548) Add explode expression
[ https://issues.apache.org/jira/browse/SPARK-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539460#comment-14539460 ] Reynold Xin edited comment on SPARK-7548 at 5/12/15 7:48 AM: - cc [~cloud_fan] if you have time to do this today, try to take it over. Basically creating an explode function in functions.scala and functions.py. was (Author: rxin): cc [~cloud_fan] if you have time to do this today, try to take it over. Add explode expression -- Key: SPARK-7548 URL: https://issues.apache.org/jira/browse/SPARK-7548 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7560) Make flaky tests easier to debug
Patrick Wendell created SPARK-7560: -- Summary: Make flaky tests easier to debug Key: SPARK-7560 URL: https://issues.apache.org/jira/browse/SPARK-7560 Project: Spark Issue Type: New Feature Components: Project Infra, Tests Reporter: Patrick Wendell Right now it's really hard for people to even get the logs from a flakey test. Once you get the logs, it's very difficult to figure out what logs are associated with what tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4128) Create instructions on fully building Spark in Intellij
[ https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539480#comment-14539480 ] Sean Owen commented on SPARK-4128: -- It's still there... https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ The previous text was just outdated. Create instructions on fully building Spark in Intellij --- Key: SPARK-4128 URL: https://issues.apache.org/jira/browse/SPARK-4128 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.2.0 With some of our more complicated modules, I'm not sure whether Intellij correctly understands all source locations. Also, we might require specifying some profiles for the build to work directly. We should document clearly how to start with vanilla Spark master and get the entire thing building in Intellij. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7563) OutputCommitCoordinator.stop() should only be executed in driver
Hailong Wen created SPARK-7563: -- Summary: OutputCommitCoordinator.stop() should only be executed in driver Key: SPARK-7563 URL: https://issues.apache.org/jira/browse/SPARK-7563 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Environment: Red Hat Enterprise Linux Server release 7.0 (Maipo) Spark 1.3.1 Release Reporter: Hailong Wen I am from IBM Platform Symphony team and we are integrating Spark 1.3.1 with EGO (a resource management product). In EGO we uses fine-grained dynamic allocation policy, and each Executor will exit after its tasks are all done. When testing *spark-shell*, we find that when executor of first job exit, it will stop OutputCommitCoordinator, which result in all future jobs failing. Details are as follows: We got the following error in executor when submitting job in *spark-shell* the second time (the first job submission is successful): {noformat} 15/05/11 04:02:31 INFO spark.util.AkkaUtils: Connecting to OutputCommitCoordinator: akka.tcp://sparkDriver@whlspark01:50452/user/OutputCommitCoordinator Exception in thread main akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@whlspark01:50452/), Path(/user/OutputCommitCoordinator)] at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65) at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58) at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74) at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110) at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267) at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:89) at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {noformat} And in driver side, we see a log message telling that the OutputCommitCoordinator is stopped after the first submission: {noformat} 15/05/11 04:01:23 INFO spark.scheduler.OutputCommitCoordinator$OutputCommitCoordinatorActor: OutputCommitCoordinator stopped! {noformat} We examine the code of OutputCommitCoordinator, and find that executor will reuse the ref of driver's OutputCommitCoordinatorActor. So when an executor exits, it will eventually call SparkEnv.stop(): {noformat} private[spark] def stop() { isStopped = true pythonWorkers.foreach { case(key, worker) = worker.stop() } Option(httpFileServer).foreach(_.stop()) mapOutputTracker.stop() shuffleManager.stop() broadcastManager.stop() blockManager.stop() blockManager.master.stop() metricsSystem.stop() outputCommitCoordinator.stop() --- actorSystem.shutdown() .. {noformat} and in OutputCommitCoordinator.stop(): {noformat} def stop(): Unit = synchronized { coordinatorActor.foreach(_ ! StopCoordinator) coordinatorActor = None authorizedCommittersByStage.clear() } {noformat} We now work this problem around by adding an attribute isDriver in OutputCommitCoordinator and
[jira] [Commented] (SPARK-7562) Improve error reporting for expression data type mismatch
[ https://issues.apache.org/jira/browse/SPARK-7562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539503#comment-14539503 ] Reynold Xin commented on SPARK-7562: This is related to https://issues.apache.org/jira/browse/SPARK-6444 and also there is one past attempt at this problem: https://github.com/apache/spark/pull/4685 #4685 pull request only marks expressions as unresolved, but doesn't report any error to users (e.g. we should explain why 1 + date is invalid). cc [~kai-zeng] Improve error reporting for expression data type mismatch - Key: SPARK-7562 URL: https://issues.apache.org/jira/browse/SPARK-7562 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin There is currently no error reporting for expression data types in analysis (we rely on resolved for that, which doesn't provide great error messages for types). It would be great to have that in checkAnalysis. Ideally, it should be the responsibility of each Expression itself to specify the types it requires, and report errors that way. We would need to define a simple interface for that so each Expression can implement. The default implementation can just use the information provided by ExpectsInputTypes.expectedChildTypes. cc [~marmbrus] what we discussed offline today. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7500) DAG visualization: cluster name bleeds beyond the cluster
[ https://issues.apache.org/jira/browse/SPARK-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7500: --- Assignee: Andrew Or (was: Apache Spark) DAG visualization: cluster name bleeds beyond the cluster - Key: SPARK-7500 URL: https://issues.apache.org/jira/browse/SPARK-7500 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Minor Attachments: long names.png This happens only for long names. See screenshot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7485) Remove python artifacts from the assembly jar
[ https://issues.apache.org/jira/browse/SPARK-7485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-7485. Resolution: Fixed Remove python artifacts from the assembly jar - Key: SPARK-7485 URL: https://issues.apache.org/jira/browse/SPARK-7485 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Reporter: Thomas Graves Assignee: Marcelo Vanzin Fix For: 1.4.0 We change it so that we distributed the python files via a zip file in SPARK-6869. With that we should remove the python files from the assembly jar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-7485) Remove python artifacts from the assembly jar
[ https://issues.apache.org/jira/browse/SPARK-7485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reopened SPARK-7485: -- Remove python artifacts from the assembly jar - Key: SPARK-7485 URL: https://issues.apache.org/jira/browse/SPARK-7485 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Reporter: Thomas Graves Assignee: Marcelo Vanzin Fix For: 1.4.0 We change it so that we distributed the python files via a zip file in SPARK-6869. With that we should remove the python files from the assembly jar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7485) Remove python artifacts from the assembly jar
[ https://issues.apache.org/jira/browse/SPARK-7485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-7485. Resolution: Pending Closed Fix Version/s: 1.4.0 Assignee: Marcelo Vanzin Remove python artifacts from the assembly jar - Key: SPARK-7485 URL: https://issues.apache.org/jira/browse/SPARK-7485 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Reporter: Thomas Graves Assignee: Marcelo Vanzin Fix For: 1.4.0 We change it so that we distributed the python files via a zip file in SPARK-6869. With that we should remove the python files from the assembly jar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7559) Bucketizer should include the right most boundary in the last bucket.
[ https://issues.apache.org/jira/browse/SPARK-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7559: --- Assignee: Apache Spark (was: Xiangrui Meng) Bucketizer should include the right most boundary in the last bucket. - Key: SPARK-7559 URL: https://issues.apache.org/jira/browse/SPARK-7559 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Apache Spark Priority: Minor Now we use special treatment for +inf. This could be simplified by including the largest split value in the last bucket. E.g., (x1, x2, x3) defines buckets [x1, x2) and [x2, x3]. This shouldn't affect user code much, and there are applications that need to include the right-most value. For example, we can bucketize ratings from 0 to 10 to bad, neutral, and good with splits 0, 4, 6, 10. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7559) Bucketizer should include the right most boundary in the last bucket.
[ https://issues.apache.org/jira/browse/SPARK-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7559: --- Assignee: Xiangrui Meng (was: Apache Spark) Bucketizer should include the right most boundary in the last bucket. - Key: SPARK-7559 URL: https://issues.apache.org/jira/browse/SPARK-7559 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Now we use special treatment for +inf. This could be simplified by including the largest split value in the last bucket. E.g., (x1, x2, x3) defines buckets [x1, x2) and [x2, x3]. This shouldn't affect user code much, and there are applications that need to include the right-most value. For example, we can bucketize ratings from 0 to 10 to bad, neutral, and good with splits 0, 4, 6, 10. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7559) Bucketizer should include the right most boundary in the last bucket.
[ https://issues.apache.org/jira/browse/SPARK-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14539465#comment-14539465 ] Apache Spark commented on SPARK-7559: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/6075 Bucketizer should include the right most boundary in the last bucket. - Key: SPARK-7559 URL: https://issues.apache.org/jira/browse/SPARK-7559 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Now we use special treatment for +inf. This could be simplified by including the largest split value in the last bucket. E.g., (x1, x2, x3) defines buckets [x1, x2) and [x2, x3]. This shouldn't affect user code much, and there are applications that need to include the right-most value. For example, we can bucketize ratings from 0 to 10 to bad, neutral, and good with splits 0, 4, 6, 10. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7558) Log test name when starting and finishing each test
[ https://issues.apache.org/jira/browse/SPARK-7558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-7558: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-7560 Log test name when starting and finishing each test --- Key: SPARK-7558 URL: https://issues.apache.org/jira/browse/SPARK-7558 Project: Spark Issue Type: Sub-task Components: Tests Reporter: Patrick Wendell Assignee: Andrew Or Right now it's really tough to interpret testing output because logs for different tests are interspersed in the same unit-tests.log file. This makes it particularly hard to diagnose flaky tests. This would be much easier if we logged the test name before and after every test (e.g. Starting test XX, Finished test XX). Then you could get right to the logs. I think one way to do this might be to create a custom test fixture that logs the test class name and then mix that into every test suite /cc [~joshrosen] for his superb knowledge of Scalatest. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4128) Create instructions on fully building Spark in Intellij
[ https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540217#comment-14540217 ] Christian Kadner edited comment on SPARK-4128 at 5/12/15 5:04 PM: -- Hi Sean, while there is still a section covering the IntelliJ setup, what is missing are these steps (or an updated version of it) which have to be taken in order to get a successfully Make of the project. I needed to do some version of it for 1.3.0, 1.3.1, 1.4.0. part of Patrick's deleted paragraph - start ... At the top of the leftmost pane, make sure the Project/Packages selector is set to Packages. Right click on any package and click “Open Module Settings” - you will be able to modify any of the modules here. A few of the modules need to be modified slightly from the default import. Add sources to the following modules: Under “Sources” tab add a source on the right. spark-hive: add v0.13.1/src/main/scala spark-hive-thriftserver v0.13.1/src/main/scala spark-repl: scala-2.10/src/main/scala and scala-2.10/src/test/scala For spark-yarn click “Add content root” and navigate in the filesystem to yarn/common directory of Spark ... part of Patrick's deleted paragraph - end I suggest to add an updated version of that to the wiki, since some of the Modules are setup in a way that similar non-obvious manual steps are required to make them compile. was (Author: ckadner): Hi Sean, while there is still a section covering the IntelliJ setup, what is missing are these steps, or an updated version of it, which I had to do for 1.3.0, 1.3.1, 1.4.0 in order to get a successfully Make of the project. part of Patrick's deleted paragraph - start ... At the top of the leftmost pane, make sure the Project/Packages selector is set to Packages. Right click on any package and click “Open Module Settings” - you will be able to modify any of the modules here. A few of the modules need to be modified slightly from the default import. Add sources to the following modules: Under “Sources” tab add a source on the right. spark-hive: add v0.13.1/src/main/scala spark-hive-thriftserver v0.13.1/src/main/scala spark-repl: scala-2.10/src/main/scala and scala-2.10/src/test/scala For spark-yarn click “Add content root” and navigate in the filesystem to yarn/common directory of Spark ... part of Patrick's deleted paragraph - end I suggest to add an updated version of that to the wiki, since some of the Modules are setup in a way that similar non-obvious manual steps are required to make them compile. Create instructions on fully building Spark in Intellij --- Key: SPARK-4128 URL: https://issues.apache.org/jira/browse/SPARK-4128 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.2.0 With some of our more complicated modules, I'm not sure whether Intellij correctly understands all source locations. Also, we might require specifying some profiles for the build to work directly. We should document clearly how to start with vanilla Spark master and get the entire thing building in Intellij. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6749) Make metastore client robust to underlying socket connection loss
[ https://issues.apache.org/jira/browse/SPARK-6749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6749: Assignee: (was: Yin Huai) Make metastore client robust to underlying socket connection loss - Key: SPARK-6749 URL: https://issues.apache.org/jira/browse/SPARK-6749 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Priority: Critical Right now, if metastore get restarted, we have to restart the driver to get a new connection to the metastore client because the underlying socket connection is gone. We should make metastore client robust to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7566) HiveContext.analyzer cannot be overriden
Santiago M. Mola created SPARK-7566: --- Summary: HiveContext.analyzer cannot be overriden Key: SPARK-7566 URL: https://issues.apache.org/jira/browse/SPARK-7566 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Santiago M. Mola Trying to override HiveContext.analyzer will give the following compilation error: {code} Error:(51, 36) overriding lazy value analyzer in class HiveContext of type org.apache.spark.sql.catalyst.analysis.Analyzer{val extendedResolutionRules: List[org.apache.spark.sql.catalyst.rules.Rule[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]]}; lazy value analyzer has incompatible type override protected[sql] lazy val analyzer: Analyzer = { ^ {code} That is because the type changed inadvertedly when omitting the type declaration of the return type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6876) DataFrame.na.replace value support for Python
[ https://issues.apache.org/jira/browse/SPARK-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-6876. Resolution: Pending Closed Fix Version/s: 1.4.0 DataFrame.na.replace value support for Python - Key: SPARK-6876 URL: https://issues.apache.org/jira/browse/SPARK-6876 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Adrian Wang Fix For: 1.4.0 Scala/Java support is in. We should provide the Python version, similar to what Pandas supports. http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5182) Partitioning support for tables created by the data source API
[ https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-5182. --- Resolution: Pending Closed Fix Version/s: 1.4.0 Issue resolved by pull request 5526 [https://github.com/apache/spark/pull/5526] Partitioning support for tables created by the data source API -- Key: SPARK-5182 URL: https://issues.apache.org/jira/browse/SPARK-5182 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Cheng Lian Priority: Blocker Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7380) Python: Transformer/Estimator should be copyable
[ https://issues.apache.org/jira/browse/SPARK-7380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7380: --- Assignee: Apache Spark (was: Joseph K. Bradley) Python: Transformer/Estimator should be copyable Key: SPARK-7380 URL: https://issues.apache.org/jira/browse/SPARK-7380 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Joseph K. Bradley Assignee: Apache Spark Same as [SPARK-5956] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7380) Python: Transformer/Estimator should be copyable
[ https://issues.apache.org/jira/browse/SPARK-7380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540398#comment-14540398 ] Apache Spark commented on SPARK-7380: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/6088 Python: Transformer/Estimator should be copyable Key: SPARK-7380 URL: https://issues.apache.org/jira/browse/SPARK-7380 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Same as [SPARK-5956] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7380) Python: Transformer/Estimator should be copyable
[ https://issues.apache.org/jira/browse/SPARK-7380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7380: --- Assignee: Joseph K. Bradley (was: Apache Spark) Python: Transformer/Estimator should be copyable Key: SPARK-7380 URL: https://issues.apache.org/jira/browse/SPARK-7380 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Same as [SPARK-5956] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6749) Make metastore client robust to underlying socket connection loss
[ https://issues.apache.org/jira/browse/SPARK-6749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6749: Assignee: Yin Huai Make metastore client robust to underlying socket connection loss - Key: SPARK-6749 URL: https://issues.apache.org/jira/browse/SPARK-6749 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Critical Right now, if metastore get restarted, we have to restart the driver to get a new connection to the metastore client because the underlying socket connection is gone. We should make metastore client robust to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6749) Make metastore client robust to underlying socket connection loss
[ https://issues.apache.org/jira/browse/SPARK-6749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6749: Priority: Critical (was: Major) Make metastore client robust to underlying socket connection loss - Key: SPARK-6749 URL: https://issues.apache.org/jira/browse/SPARK-6749 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Priority: Critical Right now, if metastore get restarted, we have to restart the driver to get a new connection to the metastore client because the underlying socket connection is gone. We should make metastore client robust to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6980) Akka timeout exceptions indicate which conf controls them
[ https://issues.apache.org/jira/browse/SPARK-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540335#comment-14540335 ] Harsh Gupta commented on SPARK-6980: [~bryanc] can you update with the progress so that we can share the work load ? Akka timeout exceptions indicate which conf controls them - Key: SPARK-6980 URL: https://issues.apache.org/jira/browse/SPARK-6980 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Imran Rashid Assignee: Harsh Gupta Priority: Minor Labels: starter Attachments: Spark-6980-Test.scala If you hit one of the akka timeouts, you just get an exception like {code} java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] {code} The exception doesn't indicate how to change the timeout, though there is usually (always?) a corresponding setting in {{SparkConf}} . It would be nice if the exception including the relevant setting. I think this should be pretty easy to do -- we just need to create something like a {{NamedTimeout}}. It would have its own {{await}} method, catches the akka timeout and throws its own exception. We should change {{RpcUtils.askTimeout}} and {{RpcUtils.lookupTimeout}} to always give a {{NamedTimeout}}, so we can be sure that anytime we have a timeout, we get a better exception. Given the latest refactoring to the rpc layer, this needs to be done in both {{AkkaUtils}} and {{AkkaRpcEndpoint}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4128) Create instructions on fully building Spark in Intellij
[ https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540217#comment-14540217 ] Christian Kadner commented on SPARK-4128: - Hi Sean, while there is still a section covering the IntelliJ setup, what is missing are these steps, or an updated version of it, which I had to do for 1.3.0, 1.3.1, 1.4.0 in order to get a successfully Make of the project. part of Patrick's deleted paragraph - start ... At the top of the leftmost pane, make sure the Project/Packages selector is set to Packages. Right click on any package and click “Open Module Settings” - you will be able to modify any of the modules here. A few of the modules need to be modified slightly from the default import. Add sources to the following modules: Under “Sources” tab add a source on the right. spark-hive: add v0.13.1/src/main/scala spark-hive-thriftserver v0.13.1/src/main/scala spark-repl: scala-2.10/src/main/scala and scala-2.10/src/test/scala For spark-yarn click “Add content root” and navigate in the filesystem to yarn/common directory of Spark part of Patrick's deleted paragraph - end I suggest to add an updated version of that to the wiki, since some of the Modules are setup in a way that similar non-obvious manual steps are required to make them compile. Create instructions on fully building Spark in Intellij --- Key: SPARK-4128 URL: https://issues.apache.org/jira/browse/SPARK-4128 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.2.0 With some of our more complicated modules, I'm not sure whether Intellij correctly understands all source locations. Also, we might require specifying some profiles for the build to work directly. We should document clearly how to start with vanilla Spark master and get the entire thing building in Intellij. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4128) Create instructions on fully building Spark in Intellij
[ https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540217#comment-14540217 ] Christian Kadner edited comment on SPARK-4128 at 5/12/15 5:02 PM: -- Hi Sean, while there is still a section covering the IntelliJ setup, what is missing are these steps, or an updated version of it, which I had to do for 1.3.0, 1.3.1, 1.4.0 in order to get a successfully Make of the project. part of Patrick's deleted paragraph - start ... At the top of the leftmost pane, make sure the Project/Packages selector is set to Packages. Right click on any package and click “Open Module Settings” - you will be able to modify any of the modules here. A few of the modules need to be modified slightly from the default import. Add sources to the following modules: Under “Sources” tab add a source on the right. spark-hive: add v0.13.1/src/main/scala spark-hive-thriftserver v0.13.1/src/main/scala spark-repl: scala-2.10/src/main/scala and scala-2.10/src/test/scala For spark-yarn click “Add content root” and navigate in the filesystem to yarn/common directory of Spark ... part of Patrick's deleted paragraph - end I suggest to add an updated version of that to the wiki, since some of the Modules are setup in a way that similar non-obvious manual steps are required to make them compile. was (Author: ckadner): Hi Sean, while there is still a section covering the IntelliJ setup, what is missing are these steps, or an updated version of it, which I had to do for 1.3.0, 1.3.1, 1.4.0 in order to get a successfully Make of the project. part of Patrick's deleted paragraph - start ... At the top of the leftmost pane, make sure the Project/Packages selector is set to Packages. Right click on any package and click “Open Module Settings” - you will be able to modify any of the modules here. A few of the modules need to be modified slightly from the default import. Add sources to the following modules: Under “Sources” tab add a source on the right. spark-hive: add v0.13.1/src/main/scala spark-hive-thriftserver v0.13.1/src/main/scala spark-repl: scala-2.10/src/main/scala and scala-2.10/src/test/scala For spark-yarn click “Add content root” and navigate in the filesystem to yarn/common directory of Spark part of Patrick's deleted paragraph - end I suggest to add an updated version of that to the wiki, since some of the Modules are setup in a way that similar non-obvious manual steps are required to make them compile. Create instructions on fully building Spark in Intellij --- Key: SPARK-4128 URL: https://issues.apache.org/jira/browse/SPARK-4128 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.2.0 With some of our more complicated modules, I'm not sure whether Intellij correctly understands all source locations. Also, we might require specifying some profiles for the build to work directly. We should document clearly how to start with vanilla Spark master and get the entire thing building in Intellij. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7525) Could not read data from write ahead log record when Receiver failed and WAL is stored in Tachyon
[ https://issues.apache.org/jira/browse/SPARK-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540325#comment-14540325 ] Dibyendu Bhattacharya commented on SPARK-7525: -- I guess this is something to do with the lack of Tachyon Append Support. java.lang.IllegalStateException: File exists and there is no append support! at org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:33) at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.org$apache$spark$streaming$util$FileBasedWriteAheadLogWriter$$stream$lzycompute(FileBasedWriteAheadLogWriter.scala:33) at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.org$apache$spark$streaming$util$FileBasedWriteAheadLogWriter$$stream(FileBasedWriteAheadLogWriter.scala:33) at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.init(FileBasedWriteAheadLogWriter.scala:41) at org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:194) at org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:81) at org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:44) at org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler$$anonfun$5.apply(ReceivedBlockHandler.scala:178) at org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler$$anonfun$5.apply(ReceivedBlockHandler.scala:178) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Could not read data from write ahead log record when Receiver failed and WAL is stored in Tachyon - Key: SPARK-7525 URL: https://issues.apache.org/jira/browse/SPARK-7525 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Environment: AWS , Spark Streaming 1.4 with Tachyon 0.6.4 Reporter: Dibyendu Bhattacharya I was testing Fault Tolerant aspect of Spark Streaming when Checkpoint directory is stored in Tachyon. Spark Streaming is able to recover from Driver failure , but when Receiver Failed, Spark Streaming not able read the WAL files written by failed Receiver. Below is exception when Receiver is failed . INFO : org.apache.spark.scheduler.DAGScheduler - Executor lost: 2 (epoch 1) INFO : org.apache.spark.storage.BlockManagerMasterEndpoint - Trying to remove executor 2 from BlockManagerMaster. INFO : org.apache.spark.storage.BlockManagerMasterEndpoint - Removing block manager BlockManagerId(2, 10.252.5.54, 45789) INFO : org.apache.spark.storage.BlockManagerMaster - Removed 2 successfully in removeExecutor INFO : org.apache.spark.streaming.scheduler.ReceiverTracker - Registered receiver for stream 2 from 10.252.5.62:47255 WARN : org.apache.spark.scheduler.TaskSetManager - Lost task 2.1 in stage 103.0 (TID 421, 10.252.5.62): org.apache.spark.SparkException: Could not read data from write ahead log record FileBasedWriteAheadLogSegment(tachyon-ft://10.252.5.113:19998/tachyon/checkpoint/receivedData/2/log-1431341091711-1431341151711,645603894,10891919) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:144) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:168) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:168) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:168) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at
[jira] [Assigned] (SPARK-6258) Python MLlib API missing items: Clustering
[ https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6258: --- Assignee: Apache Spark Python MLlib API missing items: Clustering -- Key: SPARK-6258 URL: https://issues.apache.org/jira/browse/SPARK-6258 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Apache Spark This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. KMeans * setEpsilon * setInitializationSteps KMeansModel * computeCost * k GaussianMixture * setInitialModel GaussianMixtureModel * k Completely missing items which should be fixed in separate JIRAs (which have been created and linked to the umbrella JIRA) * LDA * PowerIterationClustering * StreamingKMeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6258) Python MLlib API missing items: Clustering
[ https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540268#comment-14540268 ] Apache Spark commented on SPARK-6258: - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/6087 Python MLlib API missing items: Clustering -- Key: SPARK-6258 URL: https://issues.apache.org/jira/browse/SPARK-6258 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. KMeans * setEpsilon * setInitializationSteps KMeansModel * computeCost * k GaussianMixture * setInitialModel GaussianMixtureModel * k Completely missing items which should be fixed in separate JIRAs (which have been created and linked to the umbrella JIRA) * LDA * PowerIterationClustering * StreamingKMeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6258) Python MLlib API missing items: Clustering
[ https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6258: --- Assignee: (was: Apache Spark) Python MLlib API missing items: Clustering -- Key: SPARK-6258 URL: https://issues.apache.org/jira/browse/SPARK-6258 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. KMeans * setEpsilon * setInitializationSteps KMeansModel * computeCost * k GaussianMixture * setInitialModel GaussianMixtureModel * k Completely missing items which should be fixed in separate JIRAs (which have been created and linked to the umbrella JIRA) * LDA * PowerIterationClustering * StreamingKMeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4128) Create instructions on fully building Spark in Intellij
[ https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540243#comment-14540243 ] Sean Owen commented on SPARK-4128: -- Some of this isn't correct, like the YARN bit. Some of this isn't applicable to all users, like those that don't need Hive. That's why they were removed as required setup. Create instructions on fully building Spark in Intellij --- Key: SPARK-4128 URL: https://issues.apache.org/jira/browse/SPARK-4128 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.2.0 With some of our more complicated modules, I'm not sure whether Intellij correctly understands all source locations. Also, we might require specifying some profiles for the build to work directly. We should document clearly how to start with vanilla Spark master and get the entire thing building in Intellij. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4128) Create instructions on fully building Spark in Intellij
[ https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540410#comment-14540410 ] Sean Owen commented on SPARK-4128: -- I don't think I had to do anything special to get Hive working (it's enabled for me). Are you certain that it doesn't recognize the source folder? the source should be in the place the build says it is and IJ understands that. That said there have been all kinds of wild glitches over time. If it is really required from a clean checkout / new project, well, yeah that can be doc'ed but I also want to fix it! Yeah the Scala 2.11/10 support is handled outside of any of the build scripts. It should work either way if you run the script to switch between them but certainly needs a reimport. Create instructions on fully building Spark in Intellij --- Key: SPARK-4128 URL: https://issues.apache.org/jira/browse/SPARK-4128 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.2.0 With some of our more complicated modules, I'm not sure whether Intellij correctly understands all source locations. Also, we might require specifying some profiles for the build to work directly. We should document clearly how to start with vanilla Spark master and get the entire thing building in Intellij. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7567) Migrating Parquet data source to FSBasedRelation
Cheng Lian created SPARK-7567: - Summary: Migrating Parquet data source to FSBasedRelation Key: SPARK-7567 URL: https://issues.apache.org/jira/browse/SPARK-7567 Project: Spark Issue Type: Bug Reporter: Cheng Lian Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7569) Improve error for binary expressions
Michael Armbrust created SPARK-7569: --- Summary: Improve error for binary expressions Key: SPARK-7569 URL: https://issues.apache.org/jira/browse/SPARK-7569 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical This is not a great error: {code} scala Seq((1,1)).toDF(a, b).select(lit(1) + new java.sql.Date(1)) org.apache.spark.sql.AnalysisException: invalid expression (1 + 0) between Literal 1, IntegerType and Literal 0, DateType; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7567) Migrating Parquet data source to FSBasedRelation
[ https://issues.apache.org/jira/browse/SPARK-7567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7567: --- Assignee: Cheng Lian (was: Apache Spark) Migrating Parquet data source to FSBasedRelation Key: SPARK-7567 URL: https://issues.apache.org/jira/browse/SPARK-7567 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7568) ml.LogisticRegression doesn't output the right prediction
[ https://issues.apache.org/jira/browse/SPARK-7568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7568: - Description: `bin/spark-submit examples/src/main/python/ml/simple_text_classification_pipeline.py` {code} Row(id=4, text=u'spark i j k', words=[u'spark', u'i', u'j', u'k'], features=SparseVector(262144, {105: 1.0, 106: 1.0, 107: 1.0, 62173: 1.0}), rawPrediction=DenseVector([0.1629, -0.1629]), probability=DenseVector([0.5406, 0.4594]), prediction=0.0) Row(id=5, text=u'l m n', words=[u'l', u'm', u'n'], features=SparseVector(262144, {108: 1.0, 109: 1.0, 110: 1.0}), rawPrediction=DenseVector([2.6407, -2.6407]), probability=DenseVector([0.9334, 0.0666]), prediction=0.0) Row(id=6, text=u'mapreduce spark', words=[u'mapreduce', u'spark'], features=SparseVector(262144, {62173: 1.0, 140738: 1.0}), rawPrediction=DenseVector([1.2651, -1.2651]), probability=DenseVector([0.7799, 0.2201]), prediction=0.0) Row(id=7, text=u'apache hadoop', words=[u'apache', u'hadoop'], features=SparseVector(262144, {128334: 1.0, 134181: 1.0}), rawPrediction=DenseVector([3.7429, -3.7429]), probability=DenseVector([0.9769, 0.0231]), prediction=0.0) {code} In Scala {code} $ bin/run-example ml.SimpleTextClassificationPipeline (4, spark i j k) -- prob=[0.5406433544851436,0.45935664551485655], prediction=0.0 (5, l m n) -- prob=[0.9334382627383263,0.06656173726167364], prediction=0.0 (6, mapreduce spark) -- prob=[0.7799076868203896,0.22009231317961045], prediction=0.0 (7, apache hadoop) -- prob=[0.9768636139518304,0.023136386048169616], prediction=0.0 {code} All predictions are 0, while some should be one based on the probability. It seems to be an issue with regularization. was: `bin/spark-submit examples/src/main/python/ml/simple_text_classification_pipeline.py` {code} Row(id=4, text=u'spark i j k', words=[u'spark', u'i', u'j', u'k'], features=SparseVector(262144, {105: 1.0, 106: 1.0, 107: 1.0, 62173: 1.0}), rawPrediction=DenseVector([0.1629, -0.1629]), probability=DenseVector([0.5406, 0.4594]), prediction=0.0) Row(id=5, text=u'l m n', words=[u'l', u'm', u'n'], features=SparseVector(262144, {108: 1.0, 109: 1.0, 110: 1.0}), rawPrediction=DenseVector([2.6407, -2.6407]), probability=DenseVector([0.9334, 0.0666]), prediction=0.0) Row(id=6, text=u'mapreduce spark', words=[u'mapreduce', u'spark'], features=SparseVector(262144, {62173: 1.0, 140738: 1.0}), rawPrediction=DenseVector([1.2651, -1.2651]), probability=DenseVector([0.7799, 0.2201]), prediction=0.0) Row(id=7, text=u'apache hadoop', words=[u'apache', u'hadoop'], features=SparseVector(262144, {128334: 1.0, 134181: 1.0}), rawPrediction=DenseVector([3.7429, -3.7429]), probability=DenseVector([0.9769, 0.0231]), prediction=0.0) {code} All predictions are 0, while some should be one based on the probability. It seems to be an issue with regularization. ml.LogisticRegression doesn't output the right prediction - Key: SPARK-7568 URL: https://issues.apache.org/jira/browse/SPARK-7568 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: DB Tsai Priority: Blocker `bin/spark-submit examples/src/main/python/ml/simple_text_classification_pipeline.py` {code} Row(id=4, text=u'spark i j k', words=[u'spark', u'i', u'j', u'k'], features=SparseVector(262144, {105: 1.0, 106: 1.0, 107: 1.0, 62173: 1.0}), rawPrediction=DenseVector([0.1629, -0.1629]), probability=DenseVector([0.5406, 0.4594]), prediction=0.0) Row(id=5, text=u'l m n', words=[u'l', u'm', u'n'], features=SparseVector(262144, {108: 1.0, 109: 1.0, 110: 1.0}), rawPrediction=DenseVector([2.6407, -2.6407]), probability=DenseVector([0.9334, 0.0666]), prediction=0.0) Row(id=6, text=u'mapreduce spark', words=[u'mapreduce', u'spark'], features=SparseVector(262144, {62173: 1.0, 140738: 1.0}), rawPrediction=DenseVector([1.2651, -1.2651]), probability=DenseVector([0.7799, 0.2201]), prediction=0.0) Row(id=7, text=u'apache hadoop', words=[u'apache', u'hadoop'], features=SparseVector(262144, {128334: 1.0, 134181: 1.0}), rawPrediction=DenseVector([3.7429, -3.7429]), probability=DenseVector([0.9769, 0.0231]), prediction=0.0) {code} In Scala {code} $ bin/run-example ml.SimpleTextClassificationPipeline (4, spark i j k) -- prob=[0.5406433544851436,0.45935664551485655], prediction=0.0 (5, l m n) -- prob=[0.9334382627383263,0.06656173726167364], prediction=0.0 (6, mapreduce spark) -- prob=[0.7799076868203896,0.22009231317961045], prediction=0.0 (7, apache hadoop) -- prob=[0.9768636139518304,0.023136386048169616], prediction=0.0 {code} All predictions are 0, while some should be one based on the probability. It seems to be an issue with regularization. -- This message
[jira] [Commented] (SPARK-7561) Install Junit Attachment Plugin on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-7561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540498#comment-14540498 ] shane knapp commented on SPARK-7561: it's installed, but i will need to restart jenkins one morning to activate the plugin. Install Junit Attachment Plugin on Jenkins -- Key: SPARK-7561 URL: https://issues.apache.org/jira/browse/SPARK-7561 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Patrick Wendell Assignee: shane knapp As part of SPARK-7560 I'd like to just attach the test output file to the Jenkins build. This is nicer than requiring someone have an SSH login to the master node. Currently we gzip the logs, copy it to the master, and then delete them on the worker. https://github.com/apache/spark/blob/master/dev/run-tests-jenkins#L132 Instead I think we can just gzip them and then have the attachment plugin add them to the build. But it would require installing this plug-in to see if we can get it working. [~shaneknapp] not sure how willing you are to install plug-ins on Jenkins, but this one would be awesome if it's doable and we can get it working. https://wiki.jenkins-ci.org/display/JENKINS/JUnit+Attachments+Plugin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4128) Create instructions on fully building Spark in Intellij
[ https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540502#comment-14540502 ] Christian Kadner edited comment on SPARK-4128 at 5/12/15 7:11 PM: -- Yes, I encountered these compile problems after a fresh import of the Spark 1.3.0 and 1.3.1 project from download (.tgz) and 1.4 when loaded from a Git repository. For Scala 2.10/2.11 support, I suppose either one should be chosen by default without having to run a script. Btw, that should be doc'd as well ;-) was (Author: ckadner): Yes, I encountered these compile problems after a fresh import of the Spark 1.4 project both when downloaded (tar/zip) and when loaded from a Git repository. For Scala 2.10/2.11 support, I suppose either one should be chosen by default without having to run a script. Btw, that should be doc'd as well ;-) Create instructions on fully building Spark in Intellij --- Key: SPARK-4128 URL: https://issues.apache.org/jira/browse/SPARK-4128 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.2.0 With some of our more complicated modules, I'm not sure whether Intellij correctly understands all source locations. Also, we might require specifying some profiles for the build to work directly. We should document clearly how to start with vanilla Spark master and get the entire thing building in Intellij. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7571) Rename `Math` to `math` in MLlib's Scala code
Xiangrui Meng created SPARK-7571: Summary: Rename `Math` to `math` in MLlib's Scala code Key: SPARK-7571 URL: https://issues.apache.org/jira/browse/SPARK-7571 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Trivial scala.Math was deprecated since 2.8. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7422) Add argmax to Vector, SparseVector
[ https://issues.apache.org/jira/browse/SPARK-7422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540574#comment-14540574 ] George Dittmar commented on SPARK-7422: --- Yep will do. Do you want me to hold off on the PR for the other jira until this one gets merged in or can I just put them in at the same time but separate? Add argmax to Vector, SparseVector -- Key: SPARK-7422 URL: https://issues.apache.org/jira/browse/SPARK-7422 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Priority: Minor Labels: starter DenseVector has an argmax method which is currently private to Spark. It would be nice to add that method to Vector and SparseVector. Adding it to SparseVector would require being careful about handling the inactive elements correctly and efficiently. We should make argmax public and add unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4128) Create instructions on fully building Spark in Intellij
[ https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540550#comment-14540550 ] Sean Owen commented on SPARK-4128: -- OK, propose the text you want to add back and I'll put that in the wiki. You don't have to run a script to do anything; 2.10 is the default. Create instructions on fully building Spark in Intellij --- Key: SPARK-4128 URL: https://issues.apache.org/jira/browse/SPARK-4128 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.2.0 With some of our more complicated modules, I'm not sure whether Intellij correctly understands all source locations. Also, we might require specifying some profiles for the build to work directly. We should document clearly how to start with vanilla Spark master and get the entire thing building in Intellij. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7572) Move Param and Params to ml.param in PySpark
Xiangrui Meng created SPARK-7572: Summary: Move Param and Params to ml.param in PySpark Key: SPARK-7572 URL: https://issues.apache.org/jira/browse/SPARK-7572 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng to match Scala namespaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7572) Move Param and Params to ml.param in PySpark
[ https://issues.apache.org/jira/browse/SPARK-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540617#comment-14540617 ] Apache Spark commented on SPARK-7572: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/6094 Move Param and Params to ml.param in PySpark Key: SPARK-7572 URL: https://issues.apache.org/jira/browse/SPARK-7572 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng to match Scala namespaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7572) Move Param and Params to ml.param in PySpark
[ https://issues.apache.org/jira/browse/SPARK-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7572: --- Assignee: Xiangrui Meng (was: Apache Spark) Move Param and Params to ml.param in PySpark Key: SPARK-7572 URL: https://issues.apache.org/jira/browse/SPARK-7572 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng to match Scala namespaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7572) Move Param and Params to ml.param in PySpark
[ https://issues.apache.org/jira/browse/SPARK-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7572: --- Assignee: Apache Spark (was: Xiangrui Meng) Move Param and Params to ml.param in PySpark Key: SPARK-7572 URL: https://issues.apache.org/jira/browse/SPARK-7572 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Apache Spark to match Scala namespaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7552) Close files correctly when iteration is finished in WAL recovery
[ https://issues.apache.org/jira/browse/SPARK-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7552: - Labels: (was: backport-needed) Close files correctly when iteration is finished in WAL recovery Key: SPARK-7552 URL: https://issues.apache.org/jira/browse/SPARK-7552 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1, 1.4.0 Reporter: Saisai Shao Assignee: Saisai Shao Fix For: 1.3.2, 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7552) Close files correctly when iteration is finished in WAL recovery
[ https://issues.apache.org/jira/browse/SPARK-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7552. -- Resolution: Pending Closed Fix Version/s: 1.3.2 Assignee: Saisai Shao Close files correctly when iteration is finished in WAL recovery Key: SPARK-7552 URL: https://issues.apache.org/jira/browse/SPARK-7552 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1, 1.4.0 Reporter: Saisai Shao Assignee: Saisai Shao Fix For: 1.3.2, 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4128) Create instructions on fully building Spark in Intellij
[ https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540397#comment-14540397 ] Christian Kadner commented on SPARK-4128: - Not every user may care about each of the modules, and yes, these instructions may need to be revised. Yet I strongly think there should be some general text, maybe under Other Tips, that explains the need to manually update the Module settings to mark additional folders as Source folders (after selecting the right combination of Profiles and doing a Generate Sources For spark-hive this seems to still be true. Patrick had written this comment in one of his emails, which are helpful to understand why that needs to be done. In some cases in the maven build we now have pluggable source directories based on profiles using the maven build helper plug-in. This is necessary to support cross building against different Hive versions, and there will be additional instances of this due to supporting scala 2.11 and 2.10. In these cases, you may need to add source locations explicitly to intellij if you want the entire project to compile there. Unfortunately as long as we support cross-building like this, it will be an issue. Intellij's maven support does not correctly detect our use of the maven-build-plugin to add source directories. Besides fixing the module settings for spark-hive, I had to change the flume-sink module settings to mark target\scala-2.10\src_managed\main\compiled_avro folder as additional Source Folder. Create instructions on fully building Spark in Intellij --- Key: SPARK-4128 URL: https://issues.apache.org/jira/browse/SPARK-4128 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.2.0 With some of our more complicated modules, I'm not sure whether Intellij correctly understands all source locations. Also, we might require specifying some profiles for the build to work directly. We should document clearly how to start with vanilla Spark master and get the entire thing building in Intellij. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5707) Enabling spark.sql.codegen throws ClassNotFound exception
[ https://issues.apache.org/jira/browse/SPARK-5707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5707: --- Assignee: Ram Sriharsha Enabling spark.sql.codegen throws ClassNotFound exception - Key: SPARK-5707 URL: https://issues.apache.org/jira/browse/SPARK-5707 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.1 Environment: yarn-client mode, spark.sql.codegen=true Reporter: Yi Yao Assignee: Ram Sriharsha Priority: Blocker Exception thrown: {noformat} org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in stage 133.0 failed 4 times, most recent failure: Lost task 13.3 in stage 133.0 (TID 3066, cdh52-node2): java.io.IOException: com.esotericsoftware.kryo.KryoException: Unable to find class: __wrapper$1$81257352e1c844aebf09cb84fe9e7459.__wrapper$1$81257352e1c844aebf09cb84fe9e7459$SpecificRow$1 Serialization trace: hashTable (org.apache.spark.sql.execution.joins.UniqueKeyHashedRelation) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:62) at org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:61) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at
[jira] [Created] (SPARK-7570) Ignore _temporary folders during partition discovery
Cheng Lian created SPARK-7570: - Summary: Ignore _temporary folders during partition discovery Key: SPARK-7570 URL: https://issues.apache.org/jira/browse/SPARK-7570 Project: Spark Issue Type: Improvement Reporter: Cheng Lian Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7425) spark.ml Predictor should support other numeric types for label
[ https://issues.apache.org/jira/browse/SPARK-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540585#comment-14540585 ] Joseph K. Bradley commented on SPARK-7425: -- Should we not just support all NumericType sub-types? spark.ml Predictor should support other numeric types for label --- Key: SPARK-7425 URL: https://issues.apache.org/jira/browse/SPARK-7425 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor Labels: starter Currently, the Predictor abstraction expects the input labelCol type to be DoubleType, but we should support other numeric types. This will involve updating the PredictorParams.validateAndTransformSchema method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5446) Parquet column pruning should work for Map and Struct
[ https://issues.apache.org/jira/browse/SPARK-5446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540609#comment-14540609 ] Michael Armbrust commented on SPARK-5446: - Can you post the query execution for all four versions of the query? Parquet column pruning should work for Map and Struct - Key: SPARK-5446 URL: https://issues.apache.org/jira/browse/SPARK-5446 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Jianshi Huang Consider the following query: {code:sql} select stddev_pop(variables.var1) stddev from model group by model_name {code} Where variables is a Struct containing many fields, similarly it can be a Map with many key-value pairs. During execution, SparkSQL will shuffle the whole map or struct column instead of extracting the value first. The performance is very poor. The optimized version could use a subquery: {code:sql} select stddev_pop(var) stddev from (select variables.var1 as var, model_name from model) m group by model_name {code} Where we extract the field/key-value only in the mapper side, so data being shuffled is small. A benchmark for a table with 600 variables shows drastic improvment in runtime: || || Parquet (using Map) || Parquet (using Struct) || | Stddev (unoptimized) | 12890s |583s | | Stddev (optimized)| 84s | 61s | Parquet already supports reading a single field/key-value in the storage level, but SparkSQL currently doesn’t have optimization for it. This will be very useful optimization for tables having Map or Struct with many columns. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7568) ml.LogisticRegression doesn't output the right prediction
Xiangrui Meng created SPARK-7568: Summary: ml.LogisticRegression doesn't output the right prediction Key: SPARK-7568 URL: https://issues.apache.org/jira/browse/SPARK-7568 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: DB Tsai Priority: Blocker `bin/spark-submit examples/src/main/python/ml/simple_text_classification_pipeline.py` {code} Row(id=4, text=u'spark i j k', words=[u'spark', u'i', u'j', u'k'], features=SparseVector(262144, {105: 1.0, 106: 1.0, 107: 1.0, 62173: 1.0}), rawPrediction=DenseVector([0.1629, -0.1629]), probability=DenseVector([0.5406, 0.4594]), prediction=0.0) Row(id=5, text=u'l m n', words=[u'l', u'm', u'n'], features=SparseVector(262144, {108: 1.0, 109: 1.0, 110: 1.0}), rawPrediction=DenseVector([2.6407, -2.6407]), probability=DenseVector([0.9334, 0.0666]), prediction=0.0) Row(id=6, text=u'mapreduce spark', words=[u'mapreduce', u'spark'], features=SparseVector(262144, {62173: 1.0, 140738: 1.0}), rawPrediction=DenseVector([1.2651, -1.2651]), probability=DenseVector([0.7799, 0.2201]), prediction=0.0) Row(id=7, text=u'apache hadoop', words=[u'apache', u'hadoop'], features=SparseVector(262144, {128334: 1.0, 134181: 1.0}), rawPrediction=DenseVector([3.7429, -3.7429]), probability=DenseVector([0.9769, 0.0231]), prediction=0.0) {code} All predictions are 0, while some should be one based on the probability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7567) Migrating Parquet data source to FSBasedRelation
[ https://issues.apache.org/jira/browse/SPARK-7567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-7567: -- Component/s: SQL Target Version/s: 1.4.0 Affects Version/s: 1.4.0 Assignee: Cheng Lian Migrating Parquet data source to FSBasedRelation Key: SPARK-7567 URL: https://issues.apache.org/jira/browse/SPARK-7567 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7569) Improve error for binary expressions
[ https://issues.apache.org/jira/browse/SPARK-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540440#comment-14540440 ] Apache Spark commented on SPARK-7569: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/6089 Improve error for binary expressions Key: SPARK-7569 URL: https://issues.apache.org/jira/browse/SPARK-7569 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical This is not a great error: {code} scala Seq((1,1)).toDF(a, b).select(lit(1) + new java.sql.Date(1)) org.apache.spark.sql.AnalysisException: invalid expression (1 + 0) between Literal 1, IntegerType and Literal 0, DateType; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7570) Ignore _temporary folders during partition discovery
[ https://issues.apache.org/jira/browse/SPARK-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540462#comment-14540462 ] Apache Spark commented on SPARK-7570: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/6091 Ignore _temporary folders during partition discovery Key: SPARK-7570 URL: https://issues.apache.org/jira/browse/SPARK-7570 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1, 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Critical When speculation is turned on, directories named {{_temporary}} may be left in data directories after saving a DataFrame. These directories should be ignored. Currently they simply fail partition discovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7570) Ignore _temporary folders during partition discovery
[ https://issues.apache.org/jira/browse/SPARK-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7570: --- Assignee: Apache Spark (was: Cheng Lian) Ignore _temporary folders during partition discovery Key: SPARK-7570 URL: https://issues.apache.org/jira/browse/SPARK-7570 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1, 1.4.0 Reporter: Cheng Lian Assignee: Apache Spark Priority: Critical When speculation is turned on, directories named {{_temporary}} may be left in data directories after saving a DataFrame. These directories should be ignored. Currently they simply fail partition discovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7570) Ignore _temporary folders during partition discovery
[ https://issues.apache.org/jira/browse/SPARK-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7570: --- Assignee: Cheng Lian (was: Apache Spark) Ignore _temporary folders during partition discovery Key: SPARK-7570 URL: https://issues.apache.org/jira/browse/SPARK-7570 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1, 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Critical When speculation is turned on, directories named {{_temporary}} may be left in data directories after saving a DataFrame. These directories should be ignored. Currently they simply fail partition discovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7425) spark.ml Predictor should support other numeric types for label
[ https://issues.apache.org/jira/browse/SPARK-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540558#comment-14540558 ] Glenn Weidner commented on SPARK-7425: -- Working on adding support at the second TODO in ml.Predictor.validateAndTransformSchema for the following spark.sql.types: DecimalType, FloatType, IntegerType, LongType, ShortType. spark.ml Predictor should support other numeric types for label --- Key: SPARK-7425 URL: https://issues.apache.org/jira/browse/SPARK-7425 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor Labels: starter Currently, the Predictor abstraction expects the input labelCol type to be DoubleType, but we should support other numeric types. This will involve updating the PredictorParams.validateAndTransformSchema method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7557) User guide update for feature transformer: HashingTF, Tokenizer
[ https://issues.apache.org/jira/browse/SPARK-7557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7557: - Summary: User guide update for feature transformer: HashingTF, Tokenizer (was: User guide update for feature transformer: HashingTF) User guide update for feature transformer: HashingTF, Tokenizer --- Key: SPARK-7557 URL: https://issues.apache.org/jira/browse/SPARK-7557 Project: Spark Issue Type: Documentation Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2018) Big-Endian (IBM Power7) Spark Serialization issue
[ https://issues.apache.org/jira/browse/SPARK-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2018. -- Resolution: Pending Closed Fix Version/s: 1.4.0 1.3.2 Issue resolved by pull request 6077 [https://github.com/apache/spark/pull/6077] Big-Endian (IBM Power7) Spark Serialization issue -- Key: SPARK-2018 URL: https://issues.apache.org/jira/browse/SPARK-2018 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Environment: hardware : IBM Power7 OS:Linux version 2.6.32-358.el6.ppc64 (mockbu...@ppc-017.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Tue Jan 29 11:43:27 EST 2013 JDK: Java(TM) SE Runtime Environment (build pxp6470sr5-20130619_01(SR5)) IBM J9 VM (build 2.6, JRE 1.7.0 Linux ppc64-64 Compressed References 20130617_152572 (JIT enabled, AOT enabled) Hadoop:Hadoop-0.2.3-CDH5.0 Spark:Spark-1.0.0 or Spark-0.9.1 spark-env.sh: export JAVA_HOME=/opt/ibm/java-ppc64-70/ export SPARK_MASTER_IP=9.114.34.69 export SPARK_WORKER_MEMORY=1m export SPARK_CLASSPATH=/home/test1/spark-1.0.0-bin-hadoop2/lib export STANDALONE_SPARK_MASTER_HOST=9.114.34.69 #export SPARK_JAVA_OPTS=' -Xdebug -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n ' Reporter: Yanjie Gao Fix For: 1.3.2, 1.4.0 We have an application run on Spark on Power7 System . But we meet an important issue about serialization. The example HdfsWordCount can meet the problem. ./bin/run-example org.apache.spark.examples.streaming.HdfsWordCount localdir We used Power7 (Big-Endian arch) and Redhat 6.4. Big-Endian is the main cause since the example ran successfully in another Power-based Little Endian setup. here is the exception stack and log: Spark Executor Command: /opt/ibm/java-ppc64-70//bin/java -cp /home/test1/spark-1.0.0-bin-hadoop2/lib::/home/test1/src/spark-1.0.0-bin-hadoop2/conf:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/ -XX:MaxPermSize=128m -Xdebug -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n -Xms512M -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler 2 p7hvs7br16 4 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker app-20140604023054- 14/06/04 02:31:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/06/04 02:31:21 INFO spark.SecurityManager: Changing view acls to: test1,yifeng 14/06/04 02:31:21 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(test1, yifeng) 14/06/04 02:31:22 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/06/04 02:31:22 INFO Remoting: Starting remoting 14/06/04 02:31:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@p7hvs7br16:39658] 14/06/04 02:31:22 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutor@p7hvs7br16:39658] 14/06/04 02:31:22 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler 14/06/04 02:31:22 INFO worker.WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker 14/06/04 02:31:23 INFO worker.WorkerWatcher: Successfully connected to akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker 14/06/04 02:31:24 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver 14/06/04 02:31:24 INFO spark.SecurityManager: Changing view acls to: test1,yifeng 14/06/04 02:31:24 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(test1, yifeng) 14/06/04 02:31:24 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/06/04 02:31:24 INFO Remoting: Starting remoting 14/06/04 02:31:24 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@p7hvs7br16:58990] 14/06/04 02:31:24 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@p7hvs7br16:58990] 14/06/04 02:31:24 INFO spark.SparkEnv: Connecting to MapOutputTracker: akka.tcp://spark@9.186.105.141:60253/user/MapOutputTracker 14/06/04 02:31:25 INFO spark.SparkEnv: Connecting to BlockManagerMaster:
[jira] [Assigned] (SPARK-7557) User guide update for feature transformer: HashingTF, Tokenizer
[ https://issues.apache.org/jira/browse/SPARK-7557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7557: --- Assignee: Apache Spark (was: Joseph K. Bradley) User guide update for feature transformer: HashingTF, Tokenizer --- Key: SPARK-7557 URL: https://issues.apache.org/jira/browse/SPARK-7557 Project: Spark Issue Type: Documentation Components: ML Reporter: Joseph K. Bradley Assignee: Apache Spark Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7555) User guide update for ElasticNet
[ https://issues.apache.org/jira/browse/SPARK-7555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540602#comment-14540602 ] Joseph K. Bradley commented on SPARK-7555: -- Note: I created a new subsection for links to spark.ml-specific guides in this JIRA's PR: [SPARK-7557]. For new algorithms like ElasticNet, we can add similar new subsections/links as needed. User guide update for ElasticNet Key: SPARK-7555 URL: https://issues.apache.org/jira/browse/SPARK-7555 Project: Spark Issue Type: Documentation Components: ML Reporter: Joseph K. Bradley Assignee: DB Tsai Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7557) User guide update for feature transformer: HashingTF, Tokenizer
[ https://issues.apache.org/jira/browse/SPARK-7557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540601#comment-14540601 ] Apache Spark commented on SPARK-7557: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/6093 User guide update for feature transformer: HashingTF, Tokenizer --- Key: SPARK-7557 URL: https://issues.apache.org/jira/browse/SPARK-7557 Project: Spark Issue Type: Documentation Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7570) Ignore _temporary folders during partition discovery
[ https://issues.apache.org/jira/browse/SPARK-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-7570: -- Component/s: SQL Description: When speculation is turned on, directories named {{_temporary}} may be left in data directories after saving a DataFrame. These directories should be ignored. Currently they simply fail partition discovery. Target Version/s: 1.4.0 Affects Version/s: 1.4.0 1.3.1 Assignee: Cheng Lian Ignore _temporary folders during partition discovery Key: SPARK-7570 URL: https://issues.apache.org/jira/browse/SPARK-7570 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1, 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Critical When speculation is turned on, directories named {{_temporary}} may be left in data directories after saving a DataFrame. These directories should be ignored. Currently they simply fail partition discovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6980) Akka timeout exceptions indicate which conf controls them
[ https://issues.apache.org/jira/browse/SPARK-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540335#comment-14540335 ] Harsh Gupta edited comment on SPARK-6980 at 5/12/15 6:48 PM: - [~bryanc] [~irashid] can you update with the progress so that we can share the work load ? I created my own PR later but realised most of the work has been done by Bryan in his PR commits . Is there anyway I can merge his PR and work in parallel with Bryan ? was (Author: harshg): [~bryanc] can you update with the progress so that we can share the work load ? Akka timeout exceptions indicate which conf controls them - Key: SPARK-6980 URL: https://issues.apache.org/jira/browse/SPARK-6980 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Imran Rashid Assignee: Harsh Gupta Priority: Minor Labels: starter Attachments: Spark-6980-Test.scala If you hit one of the akka timeouts, you just get an exception like {code} java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] {code} The exception doesn't indicate how to change the timeout, though there is usually (always?) a corresponding setting in {{SparkConf}} . It would be nice if the exception including the relevant setting. I think this should be pretty easy to do -- we just need to create something like a {{NamedTimeout}}. It would have its own {{await}} method, catches the akka timeout and throws its own exception. We should change {{RpcUtils.askTimeout}} and {{RpcUtils.lookupTimeout}} to always give a {{NamedTimeout}}, so we can be sure that anytime we have a timeout, we get a better exception. Given the latest refactoring to the rpc layer, this needs to be done in both {{AkkaUtils}} and {{AkkaRpcEndpoint}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7276) withColumn is very slow on dataframe with large number of columns
[ https://issues.apache.org/jira/browse/SPARK-7276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-7276. - Resolution: Pending Closed Fix Version/s: 1.4.0 Issue resolved by pull request 5831 [https://github.com/apache/spark/pull/5831] withColumn is very slow on dataframe with large number of columns - Key: SPARK-7276 URL: https://issues.apache.org/jira/browse/SPARK-7276 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.1 Reporter: Alexandre CLEMENT Assignee: Wenchen Fan Fix For: 1.4.0 The code snippet demonstrates the problem. {code} import org.apache.spark.sql._ import org.apache.spark.sql.types._ val sparkConf = new SparkConf().setAppName(Spark Test).setMaster(System.getProperty(spark.master, local[4])) val sc = new SparkContext(sparkConf) val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val custs = Seq( Row(1, Bob, 21, 80.5), Row(2, Bobby, 21, 80.5), Row(3, Jean, 21, 80.5), Row(4, Fatime, 21, 80.5) ) var fields = List( StructField(id, IntegerType, true), StructField(a, IntegerType, true), StructField(b, StringType, true), StructField(target, DoubleType, false)) val schema = StructType(fields) var rdd = sc.parallelize(custs) var df = sqlContext.createDataFrame(rdd, schema) for (i - 1 to 200) { val now = System.currentTimeMillis df = df.withColumn(a_new_col_ + i, df(a) + i) println(s$i - + (System.currentTimeMillis - now)) } df.show() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7531) Install GPG on Jenkins machines
[ https://issues.apache.org/jira/browse/SPARK-7531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp resolved SPARK-7531. Resolution: Pending Closed it was already installed on all hosts, we're g2g Install GPG on Jenkins machines --- Key: SPARK-7531 URL: https://issues.apache.org/jira/browse/SPARK-7531 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Patrick Wendell Assignee: shane knapp This one is also required for us to cut regular snapshot releases from Jenkins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7422) Add argmax to Vector, SparseVector
[ https://issues.apache.org/jira/browse/SPARK-7422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540509#comment-14540509 ] Joseph K. Bradley commented on SPARK-7422: -- Great! Just to confirm: Can you please do separate PRs for this JIRA and the related one you're working on? Add argmax to Vector, SparseVector -- Key: SPARK-7422 URL: https://issues.apache.org/jira/browse/SPARK-7422 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Priority: Minor Labels: starter DenseVector has an argmax method which is currently private to Spark. It would be nice to add that method to Vector and SparseVector. Adding it to SparseVector would require being careful about handling the inactive elements correctly and efficiently. We should make argmax public and add unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2018) Big-Endian (IBM Power7) Spark Serialization issue
[ https://issues.apache.org/jira/browse/SPARK-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2018: - Assignee: Tim Ellison Big-Endian (IBM Power7) Spark Serialization issue -- Key: SPARK-2018 URL: https://issues.apache.org/jira/browse/SPARK-2018 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Environment: hardware : IBM Power7 OS:Linux version 2.6.32-358.el6.ppc64 (mockbu...@ppc-017.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Tue Jan 29 11:43:27 EST 2013 JDK: Java(TM) SE Runtime Environment (build pxp6470sr5-20130619_01(SR5)) IBM J9 VM (build 2.6, JRE 1.7.0 Linux ppc64-64 Compressed References 20130617_152572 (JIT enabled, AOT enabled) Hadoop:Hadoop-0.2.3-CDH5.0 Spark:Spark-1.0.0 or Spark-0.9.1 spark-env.sh: export JAVA_HOME=/opt/ibm/java-ppc64-70/ export SPARK_MASTER_IP=9.114.34.69 export SPARK_WORKER_MEMORY=1m export SPARK_CLASSPATH=/home/test1/spark-1.0.0-bin-hadoop2/lib export STANDALONE_SPARK_MASTER_HOST=9.114.34.69 #export SPARK_JAVA_OPTS=' -Xdebug -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n ' Reporter: Yanjie Gao Assignee: Tim Ellison Fix For: 1.3.2, 1.4.0 We have an application run on Spark on Power7 System . But we meet an important issue about serialization. The example HdfsWordCount can meet the problem. ./bin/run-example org.apache.spark.examples.streaming.HdfsWordCount localdir We used Power7 (Big-Endian arch) and Redhat 6.4. Big-Endian is the main cause since the example ran successfully in another Power-based Little Endian setup. here is the exception stack and log: Spark Executor Command: /opt/ibm/java-ppc64-70//bin/java -cp /home/test1/spark-1.0.0-bin-hadoop2/lib::/home/test1/src/spark-1.0.0-bin-hadoop2/conf:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/test1/src/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/:/home/test1/src/hadoop-2.3.0-cdh5.0.0/etc/hadoop/ -XX:MaxPermSize=128m -Xdebug -Xrunjdwp:transport=dt_socket,address=9,server=y,suspend=n -Xms512M -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler 2 p7hvs7br16 4 akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker app-20140604023054- 14/06/04 02:31:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/06/04 02:31:21 INFO spark.SecurityManager: Changing view acls to: test1,yifeng 14/06/04 02:31:21 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(test1, yifeng) 14/06/04 02:31:22 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/06/04 02:31:22 INFO Remoting: Starting remoting 14/06/04 02:31:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@p7hvs7br16:39658] 14/06/04 02:31:22 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutor@p7hvs7br16:39658] 14/06/04 02:31:22 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@9.186.105.141:60253/user/CoarseGrainedScheduler 14/06/04 02:31:22 INFO worker.WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker 14/06/04 02:31:23 INFO worker.WorkerWatcher: Successfully connected to akka.tcp://sparkWorker@p7hvs7br16:59240/user/Worker 14/06/04 02:31:24 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver 14/06/04 02:31:24 INFO spark.SecurityManager: Changing view acls to: test1,yifeng 14/06/04 02:31:24 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(test1, yifeng) 14/06/04 02:31:24 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/06/04 02:31:24 INFO Remoting: Starting remoting 14/06/04 02:31:24 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@p7hvs7br16:58990] 14/06/04 02:31:24 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@p7hvs7br16:58990] 14/06/04 02:31:24 INFO spark.SparkEnv: Connecting to MapOutputTracker: akka.tcp://spark@9.186.105.141:60253/user/MapOutputTracker 14/06/04 02:31:25 INFO spark.SparkEnv: Connecting to BlockManagerMaster: akka.tcp://spark@9.186.105.141:60253/user/BlockManagerMaster 14/06/04 02:31:25 INFO storage.DiskBlockManager: Created local directory at
[jira] [Commented] (SPARK-7556) User guide update for feature transformer: Binarizer
[ https://issues.apache.org/jira/browse/SPARK-7556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540607#comment-14540607 ] Joseph K. Bradley commented on SPARK-7556: -- Note: I created a new subsection for links to spark.ml-specific guides in this JIRA's PR: [SPARK-7557]. Binarizer can go within the new subsection. I'll try to get that PR merged ASAP. Thanks! User guide update for feature transformer: Binarizer Key: SPARK-7556 URL: https://issues.apache.org/jira/browse/SPARK-7556 Project: Spark Issue Type: Documentation Components: ML Reporter: Joseph K. Bradley Assignee: Liang-Chi Hsieh Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4128) Create instructions on fully building Spark in Intellij
[ https://issues.apache.org/jira/browse/SPARK-4128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540397#comment-14540397 ] Christian Kadner edited comment on SPARK-4128 at 5/12/15 6:23 PM: -- Not every user may care about each of the modules, and yes, these instructions may need to be revised. Yet I strongly think there should be some general text, maybe under Other Tips, that explains the need to manually update the Module settings to mark additional folders as Source folders (after selecting the right combination of Profiles and doing a Generate Sources For spark-hive this seems to still be true. Patrick had written this comment in one of his emails, which is helpful to understand why that needs to be done. In some cases in the maven build we now have pluggable source directories based on profiles using the maven build helper plug-in. This is necessary to support cross building against different Hive versions, and there will be additional instances of this due to supporting scala 2.11 and 2.10. In these cases, you may need to add source locations explicitly to intellij if you want the entire project to compile there. Unfortunately as long as we support cross-building like this, it will be an issue. Intellij's maven support does not correctly detect our use of the maven-build-plugin to add source directories. Besides fixing the module settings for spark-hive, I had to change the flume-sink module settings to mark target\scala-2.10\src_managed\main\compiled_avro folder as additional Source Folder. was (Author: ckadner): Not every user may care about each of the modules, and yes, these instructions may need to be revised. Yet I strongly think there should be some general text, maybe under Other Tips, that explains the need to manually update the Module settings to mark additional folders as Source folders (after selecting the right combination of Profiles and doing a Generate Sources For spark-hive this seems to still be true. Patrick had written this comment in one of his emails, which are helpful to understand why that needs to be done. In some cases in the maven build we now have pluggable source directories based on profiles using the maven build helper plug-in. This is necessary to support cross building against different Hive versions, and there will be additional instances of this due to supporting scala 2.11 and 2.10. In these cases, you may need to add source locations explicitly to intellij if you want the entire project to compile there. Unfortunately as long as we support cross-building like this, it will be an issue. Intellij's maven support does not correctly detect our use of the maven-build-plugin to add source directories. Besides fixing the module settings for spark-hive, I had to change the flume-sink module settings to mark target\scala-2.10\src_managed\main\compiled_avro folder as additional Source Folder. Create instructions on fully building Spark in Intellij --- Key: SPARK-4128 URL: https://issues.apache.org/jira/browse/SPARK-4128 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.2.0 With some of our more complicated modules, I'm not sure whether Intellij correctly understands all source locations. Also, we might require specifying some profiles for the build to work directly. We should document clearly how to start with vanilla Spark master and get the entire thing building in Intellij. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7569) Improve error for binary expressions
[ https://issues.apache.org/jira/browse/SPARK-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7569: --- Assignee: Apache Spark (was: Michael Armbrust) Improve error for binary expressions Key: SPARK-7569 URL: https://issues.apache.org/jira/browse/SPARK-7569 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Apache Spark Priority: Critical This is not a great error: {code} scala Seq((1,1)).toDF(a, b).select(lit(1) + new java.sql.Date(1)) org.apache.spark.sql.AnalysisException: invalid expression (1 + 0) between Literal 1, IntegerType and Literal 0, DateType; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7569) Improve error for binary expressions
[ https://issues.apache.org/jira/browse/SPARK-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7569: --- Assignee: Michael Armbrust (was: Apache Spark) Improve error for binary expressions Key: SPARK-7569 URL: https://issues.apache.org/jira/browse/SPARK-7569 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Critical This is not a great error: {code} scala Seq((1,1)).toDF(a, b).select(lit(1) + new java.sql.Date(1)) org.apache.spark.sql.AnalysisException: invalid expression (1 + 0) between Literal 1, IntegerType and Literal 0, DateType; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7567) Migrating Parquet data source to FSBasedRelation
[ https://issues.apache.org/jira/browse/SPARK-7567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540450#comment-14540450 ] Apache Spark commented on SPARK-7567: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/6090 Migrating Parquet data source to FSBasedRelation Key: SPARK-7567 URL: https://issues.apache.org/jira/browse/SPARK-7567 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7567) Migrating Parquet data source to FSBasedRelation
[ https://issues.apache.org/jira/browse/SPARK-7567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7567: --- Assignee: Apache Spark (was: Cheng Lian) Migrating Parquet data source to FSBasedRelation Key: SPARK-7567 URL: https://issues.apache.org/jira/browse/SPARK-7567 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Apache Spark Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7571) Rename `Math` to `math` in MLlib's Scala code
[ https://issues.apache.org/jira/browse/SPARK-7571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7571: --- Assignee: Apache Spark (was: Xiangrui Meng) Rename `Math` to `math` in MLlib's Scala code - Key: SPARK-7571 URL: https://issues.apache.org/jira/browse/SPARK-7571 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Apache Spark Priority: Trivial scala.Math was deprecated since 2.8. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7487) Python API for ml.regression
[ https://issues.apache.org/jira/browse/SPARK-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7487: - Assignee: Burak Yavuz Python API for ml.regression Key: SPARK-7487 URL: https://issues.apache.org/jira/browse/SPARK-7487 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Burak Yavuz Assignee: Burak Yavuz Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7557) User guide update for feature transformer: HashingTF, Tokenizer
[ https://issues.apache.org/jira/browse/SPARK-7557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7557: --- Assignee: Joseph K. Bradley (was: Apache Spark) User guide update for feature transformer: HashingTF, Tokenizer --- Key: SPARK-7557 URL: https://issues.apache.org/jira/browse/SPARK-7557 Project: Spark Issue Type: Documentation Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7568) ml.LogisticRegression doesn't output the right prediction
[ https://issues.apache.org/jira/browse/SPARK-7568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7568: - Description: `bin/spark-submit examples/src/main/python/ml/simple_text_classification_pipeline.py` {code} Row(id=4, text=u'spark i j k', words=[u'spark', u'i', u'j', u'k'], features=SparseVector(262144, {105: 1.0, 106: 1.0, 107: 1.0, 62173: 1.0}), rawPrediction=DenseVector([0.1629, -0.1629]), probability=DenseVector([0.5406, 0.4594]), prediction=0.0) Row(id=5, text=u'l m n', words=[u'l', u'm', u'n'], features=SparseVector(262144, {108: 1.0, 109: 1.0, 110: 1.0}), rawPrediction=DenseVector([2.6407, -2.6407]), probability=DenseVector([0.9334, 0.0666]), prediction=0.0) Row(id=6, text=u'mapreduce spark', words=[u'mapreduce', u'spark'], features=SparseVector(262144, {62173: 1.0, 140738: 1.0}), rawPrediction=DenseVector([1.2651, -1.2651]), probability=DenseVector([0.7799, 0.2201]), prediction=0.0) Row(id=7, text=u'apache hadoop', words=[u'apache', u'hadoop'], features=SparseVector(262144, {128334: 1.0, 134181: 1.0}), rawPrediction=DenseVector([3.7429, -3.7429]), probability=DenseVector([0.9769, 0.0231]), prediction=0.0) {code} All predictions are 0, while some should be one based on the probability. It seems to be an issue with regularization. was: `bin/spark-submit examples/src/main/python/ml/simple_text_classification_pipeline.py` {code} Row(id=4, text=u'spark i j k', words=[u'spark', u'i', u'j', u'k'], features=SparseVector(262144, {105: 1.0, 106: 1.0, 107: 1.0, 62173: 1.0}), rawPrediction=DenseVector([0.1629, -0.1629]), probability=DenseVector([0.5406, 0.4594]), prediction=0.0) Row(id=5, text=u'l m n', words=[u'l', u'm', u'n'], features=SparseVector(262144, {108: 1.0, 109: 1.0, 110: 1.0}), rawPrediction=DenseVector([2.6407, -2.6407]), probability=DenseVector([0.9334, 0.0666]), prediction=0.0) Row(id=6, text=u'mapreduce spark', words=[u'mapreduce', u'spark'], features=SparseVector(262144, {62173: 1.0, 140738: 1.0}), rawPrediction=DenseVector([1.2651, -1.2651]), probability=DenseVector([0.7799, 0.2201]), prediction=0.0) Row(id=7, text=u'apache hadoop', words=[u'apache', u'hadoop'], features=SparseVector(262144, {128334: 1.0, 134181: 1.0}), rawPrediction=DenseVector([3.7429, -3.7429]), probability=DenseVector([0.9769, 0.0231]), prediction=0.0) {code} All predictions are 0, while some should be one based on the probability. ml.LogisticRegression doesn't output the right prediction - Key: SPARK-7568 URL: https://issues.apache.org/jira/browse/SPARK-7568 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: DB Tsai Priority: Blocker `bin/spark-submit examples/src/main/python/ml/simple_text_classification_pipeline.py` {code} Row(id=4, text=u'spark i j k', words=[u'spark', u'i', u'j', u'k'], features=SparseVector(262144, {105: 1.0, 106: 1.0, 107: 1.0, 62173: 1.0}), rawPrediction=DenseVector([0.1629, -0.1629]), probability=DenseVector([0.5406, 0.4594]), prediction=0.0) Row(id=5, text=u'l m n', words=[u'l', u'm', u'n'], features=SparseVector(262144, {108: 1.0, 109: 1.0, 110: 1.0}), rawPrediction=DenseVector([2.6407, -2.6407]), probability=DenseVector([0.9334, 0.0666]), prediction=0.0) Row(id=6, text=u'mapreduce spark', words=[u'mapreduce', u'spark'], features=SparseVector(262144, {62173: 1.0, 140738: 1.0}), rawPrediction=DenseVector([1.2651, -1.2651]), probability=DenseVector([0.7799, 0.2201]), prediction=0.0) Row(id=7, text=u'apache hadoop', words=[u'apache', u'hadoop'], features=SparseVector(262144, {128334: 1.0, 134181: 1.0}), rawPrediction=DenseVector([3.7429, -3.7429]), probability=DenseVector([0.9769, 0.0231]), prediction=0.0) {code} All predictions are 0, while some should be one based on the probability. It seems to be an issue with regularization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org