[jira] [Updated] (SPARK-6721) IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis Rodríguez Trejo updated SPARK-6721: Description: I get the following exception when using saveAsNewAPIHadoopFile: {code} 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 10.0.2.15): java.lang.IllegalStateException: open at org.bson.util.Assertions.isTrue(Assertions.java:36) at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406) at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184) at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167) at com.mongodb.DBCollection.insert(DBCollection.java:161) at com.mongodb.DBCollection.insert(DBCollection.java:107) at com.mongodb.DBCollection.save(DBCollection.java:1049) at com.mongodb.DBCollection.save(DBCollection.java:1014) at com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Before Spark 1.3.0 this would result in the application crashing, but now the data just remains unprocessed. There is no close instruction at any part of the code. was: I get the following exception when using saveAsNewAPIHadoopFile: bq. 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 10.0.2.15): java.lang.IllegalStateException: open at org.bson.util.Assertions.isTrue(Assertions.java:36) at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406) at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184) at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167) at com.mongodb.DBCollection.insert(DBCollection.java:161) at com.mongodb.DBCollection.insert(DBCollection.java:107) at com.mongodb.DBCollection.save(DBCollection.java:1049) at com.mongodb.DBCollection.save(DBCollection.java:1014) at com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Before Spark 1.3.0 this would result in the application crashing, but now the data just remains unprocessed. There is no close instruction at any part of the code. IllegalStateException - Key: SPARK-6721 URL: https://issues.apache.org/jira/browse/SPARK-6721 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0, 1.2.1, 1.3.0 Environment: Ubuntu 14.04, Java 8, MongoDB 3.0, Spark 1.3 Reporter: Luis Rodríguez Trejo Labels: MongoDB, java.lang.IllegalStateexception, saveAsNewAPIHadoopFile I get the following exception when using saveAsNewAPIHadoopFile: {code} 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 10.0.2.15): java.lang.IllegalStateException: open at org.bson.util.Assertions.isTrue(Assertions.java:36) at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406) at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184) at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167) at com.mongodb.DBCollection.insert(DBCollection.java:161) at com.mongodb.DBCollection.insert(DBCollection.java:107) at com.mongodb.DBCollection.save(DBCollection.java:1049) at com.mongodb.DBCollection.save(DBCollection.java:1014) at com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[jira] [Commented] (SPARK-6700) flaky test: run Python application in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481534#comment-14481534 ] Davies Liu commented on SPARK-6700: --- There is one failure here: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2036/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.deploy.yarn/YarnClusterSuite/run_Python_application_in_yarn_cluster_mode/ and here: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2025/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.deploy.yarn/YarnClusterSuite/run_Python_application_in_yarn_cluster_mode/ Is it related to hadoop2.3 ? flaky test: run Python application in yarn-cluster mode Key: SPARK-6700 URL: https://issues.apache.org/jira/browse/SPARK-6700 Project: Spark Issue Type: Bug Components: Tests Reporter: Davies Liu Assignee: Lianhui Wang Priority: Critical Labels: test, yarn org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-cluster mode Failing for the past 1 build (Since Failed#2025 ) Took 12 sec. Error Message {code} Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 Stacktrace sbt.ForkMain$ForkError: Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at
[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481455#comment-14481455 ] Joseph K. Bradley commented on SPARK-6682: -- As you're suggesting, a wrapper mechanism like won't be an acceptable solution since it would be a confusing, difficult-to-document API. Deprecate static train and use builder instead for Scala/Java - Key: SPARK-6682 URL: https://issues.apache.org/jira/browse/SPARK-6682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons of the builder pattern: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models
[ https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481464#comment-14481464 ] Joseph K. Bradley commented on SPARK-3702: -- Using Vector types is better since they store values as Array[Double], which avoids creating an object for every value. If you're thinking about feature names/metadata, the Metadata capability in DataFrame will be able to handle metadata for each feature in Vector columns. Standardize MLlib classes for learners, models -- Key: SPARK-3702 URL: https://issues.apache.org/jira/browse/SPARK-3702 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Blocker Summary: Create a class hierarchy for learning algorithms and the models those algorithms produce. This is a super-task of several sub-tasks (but JIRA does not allow subtasks of subtasks). See the requires links below for subtasks. Goals: * give intuitive structure to API, both for developers and for generated documentation * support meta-algorithms (e.g., boosting) * support generic functionality (e.g., evaluation) * reduce code duplication across classes [Design doc for class hierarchy | https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481874#comment-14481874 ] Burak Yavuz commented on SPARK-6407: I actually worked on this over the weekend for fun and have a streaming, gradient descent based, matrix factorization model implemented here: https://github.com/brkyvz/streaming-matrix-factorization It is a very naive implementation, but it might be something to work on top of. I will publish a Spark Package for it as soon as I get the tests in. The model it uses for predicting ratings for user `u` and product `p` is: {code} r = U(u) * P^T(p) + bu(u) + bp(p) + mu {code} where U(u) is the u'th row of the User matrix, P(p) is the p'th row for the product matrix, bu(u) is the u'th element of the user bias vector, bp(p) is the p'th element of the product bias vector and mu is the global average. Streaming ALS for Collaborative Filtering - Key: SPARK-6407 URL: https://issues.apache.org/jira/browse/SPARK-6407 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Felix Cheung Priority: Minor Like MLLib's ALS implementation for recommendation, and applying to streaming. Similar to streaming linear regression, logistic regression, could we apply gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6725) Model export/import for Pipeline API
Joseph K. Bradley created SPARK-6725: Summary: Model export/import for Pipeline API Key: SPARK-6725 URL: https://issues.apache.org/jira/browse/SPARK-6725 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical This is an umbrella JIRA for adding model export/import to the spark.ml API. This JIRA is for adding the internal Saveable/Loadable API and Parquet-based format, not for other formats like PMML. This will require the following steps: * Add export/import for all PipelineStages supported by spark.ml ** This will include some Transformers which are not Models. ** These can use almost the same format as the spark.mllib model save/load functions, but the model metadata must store a different class name (marking the class as a spark.ml class). * After all PipelineStages support save/load, add an interface which forces future additions to support save/load. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6722) Model import/export for StreamingKMeansModel
Joseph K. Bradley created SPARK-6722: Summary: Model import/export for StreamingKMeansModel Key: SPARK-6722 URL: https://issues.apache.org/jira/browse/SPARK-6722 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley CC: [~freeman-lab] Is this API stable enough to merit adding import/export (which will require supporting the model format version from now on)? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5988) Model import/export for PowerIterationClusteringModel
[ https://issues.apache.org/jira/browse/SPARK-5988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481891#comment-14481891 ] Joseph K. Bradley commented on SPARK-5988: -- Feel free to go ahead! I just assigned it to you. Thanks! Model import/export for PowerIterationClusteringModel - Key: SPARK-5988 URL: https://issues.apache.org/jira/browse/SPARK-5988 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Xusen Yin Add save/load for PowerIterationClusteringModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5988) Model import/export for PowerIterationClusteringModel
[ https://issues.apache.org/jira/browse/SPARK-5988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5988: - Assignee: Xusen Yin Model import/export for PowerIterationClusteringModel - Key: SPARK-5988 URL: https://issues.apache.org/jira/browse/SPARK-5988 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Xusen Yin Add save/load for PowerIterationClusteringModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6692) Add an option for client to kill AM when it is killed
[ https://issues.apache.org/jira/browse/SPARK-6692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated SPARK-6692: - Summary: Add an option for client to kill AM when it is killed (was: Make it possible to kill AM in YARN cluster mode when the client is terminated) Add an option for client to kill AM when it is killed - Key: SPARK-6692 URL: https://issues.apache.org/jira/browse/SPARK-6692 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.0 Reporter: Cheolsoo Park Assignee: Cheolsoo Park Priority: Minor Labels: yarn I understand that the yarn-cluster mode is designed for fire-and-forget model; therefore, terminating the yarn client doesn't kill AM. However, it is very common that users submit Spark jobs via job scheduler (e.g. Apache Oozie) or remote job server (e.g. Netflix Genie) where it is expected that killing the yarn client will terminate AM. It is true that the yarn-client mode can be used in such cases. But then, the yarn client sometimes needs lots of heap memory for big jobs if it runs in the yarn-client mode. In fact, the yarn-cluster mode is ideal for big jobs because AM can be given arbitrary heap memory unlike the yarn client. So it would be very useful to make it possible to kill AM even in the yarn-cluster mode. In addition, Spark jobs often become zombie jobs if users ctrl-c them as soon as they're accepted (but not yet running). Although they're eventually shutdown after AM timeout, it would be nice if AM could immediately get killed in such cases too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed
[ https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6222: --- Fix Version/s: 1.4.0 1.3.1 [STREAMING] All data may not be recovered from WAL when driver is killed Key: SPARK-6222 URL: https://issues.apache.org/jira/browse/SPARK-6222 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Hari Shreedharan Priority: Blocker Fix For: 1.3.1, 1.4.0 Attachments: AfterPatch.txt, CleanWithoutPatch.txt, SPARK-6122.patch When testing for our next release, our internal tests written by [~wypoon] caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs FlumePolling stream to read data from Flume, then kills the Application Master. Once YARN restarts it, the test waits until no more data is to be written and verifies the original against the data on HDFS. This was passing in 1.2.0, but is failing now. Since the test ties into Cloudera's internal infrastructure and build process, it cannot be directly run on an Apache build. But I have been working on isolating the commit that may have caused the regression. I have confirmed that it was caused by SPARK-5147 (PR # [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several times using the test and the failure is consistently reproducible. To re-confirm, I reverted just this one commit (and Clock consolidation one to avoid conflicts), and the issue was no longer reproducible. Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0 /cc [~tdas], [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6606) Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd object.
[ https://issues.apache.org/jira/browse/SPARK-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481639#comment-14481639 ] Apache Spark commented on SPARK-6606: - User 'kayousterhout' has created a pull request for this issue: https://github.com/apache/spark/pull/4145 Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd object. - Key: SPARK-6606 URL: https://issues.apache.org/jira/browse/SPARK-6606 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0, 1.3.0 Reporter: SuYan 1. Use code like belows, will found accumulator deserialized twice. first: {code} task = ser.deserialize[Task[Any]](taskBytes, Thread.currentThread.getContextClassLoader) {code} second: {code} val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])]( ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader) {code} which the first deserialized is not what expected. because ResultTask or ShuffleMapTask will have a partition object. in class {code} CoGroupedRDD[K](@transient var rdds: Seq[RDD[_ : Product2[K, _]]], part: Partitioner) {code}, the CogroupPartition may contains a CoGroupDep: {code} NarrowCoGroupSplitDep( rdd: RDD[_], splitIndex: Int, var split: Partition ) extends CoGroupSplitDep { {code} in that *NarrowCoGroupSplitDep*, it will bring into rdd object, which result into the first deserialized. example: {code} val acc1 = sc.accumulator(0, test1) val acc2 = sc.accumulator(0, test2) val rdd1 = sc.parallelize((1 to 10).toSeq, 3) val rdd2 = sc.parallelize((1 to 10).toSeq, 3) val combine1 = rdd1.map { case a = (a, 1)}.combineByKey(a = { acc1 += 1 a }, (a: Int, b: Int) = { a + b }, (a: Int, b: Int) = { a + b }, new HashPartitioner(3), mapSideCombine = false) val combine2 = rdd2.map { case a = (a, 1)}.combineByKey( a = { acc2 += 1 a }, (a: Int, b: Int) = { a + b }, (a: Int, b: Int) = { a + b }, new HashPartitioner(3), mapSideCombine = false) combine1.cogroup(combine2, new HashPartitioner(3)).count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481711#comment-14481711 ] Xiangrui Meng commented on SPARK-6407: -- Attached the comment from Chunnan Yao in SPARK-6711: On-line Collaborative Filtering(CF) has been widely used and studied. To re-train a CF model from scratch every time when new data comes in is very inefficient (http://stackoverflow.com/questions/27734329/apache-spark-incremental-training-of-als-model). However, in Spark community we see few discussion about collaborative filtering on streaming data. Given streaming k-means, streaming logistic regression, and the on-going incremental model training of Naive Bayes Classifier (SPARK-4144), we think it is meaningful to consider streaming Collaborative Filtering support on MLlib. We have already been considering about this issue during the past week. We plan to refer to this paper (https://www.cs.utexas.edu/~cjohnson/ParallelCollabFilt.pdf). It is based on SGD instead of ALS, which is easier to be tackled under streaming data. Fortunately, the authors of this paper have implemented their algorithm as a Github Project, based on Storm: https://github.com/MrChrisJohnson/CollabStream Streaming ALS for Collaborative Filtering - Key: SPARK-6407 URL: https://issues.apache.org/jira/browse/SPARK-6407 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Felix Cheung Priority: Minor Like MLLib's ALS implementation for recommendation, and applying to streaming. Similar to streaming linear regression, logistic regression, could we apply gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6711) Support parallelized online matrix factorization for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-6711. Resolution: Duplicate Support parallelized online matrix factorization for Collaborative Filtering - Key: SPARK-6711 URL: https://issues.apache.org/jira/browse/SPARK-6711 Project: Spark Issue Type: Improvement Components: MLlib, Streaming Reporter: Chunnan Yao Original Estimate: 840h Remaining Estimate: 840h On-line Collaborative Filtering(CF) has been widely used and studied. To re-train a CF model from scratch every time when new data comes in is very inefficient (http://stackoverflow.com/questions/27734329/apache-spark-incremental-training-of-als-model). However, in Spark community we see few discussion about collaborative filtering on streaming data. Given streaming k-means, streaming logistic regression, and the on-going incremental model training of Naive Bayes Classifier (SPARK-4144), we think it is meaningful to consider streaming Collaborative Filtering support on MLlib. We have already been considering about this issue during the past week. We plan to refer to this paper (https://www.cs.utexas.edu/~cjohnson/ParallelCollabFilt.pdf). It is based on SGD instead of ALS, which is easier to be tackled under streaming data. Fortunately, the authors of this paper have implemented their algorithm as a Github Project, based on Storm: https://github.com/MrChrisJohnson/CollabStream -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
[ https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6720: - Assignee: Kai Sasaki PySpark MultivariateStatisticalSummary unit test for normL1 and normL2 -- Key: SPARK-6720 URL: https://issues.apache.org/jira/browse/SPARK-6720 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Kai Sasaki Assignee: Kai Sasaki Priority: Minor Implement correct normL1 and normL2 test. continuation: https://github.com/apache/spark/pull/5359 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
[ https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6720: - Target Version/s: 1.4.0 Fix Version/s: (was: 1.4.0) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2 -- Key: SPARK-6720 URL: https://issues.apache.org/jira/browse/SPARK-6720 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Kai Sasaki Priority: Minor Implement correct normL1 and normL2 test. continuation: https://github.com/apache/spark/pull/5359 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6718) Improve the test on normL1/normL2 of summary statistics
[ https://issues.apache.org/jira/browse/SPARK-6718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-6718. Resolution: Duplicate Improve the test on normL1/normL2 of summary statistics --- Key: SPARK-6718 URL: https://issues.apache.org/jira/browse/SPARK-6718 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Kai Sasaki Priority: Minor As discussed on https://github.com/apache/spark/pull/5359, we should improve the unit test there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
[ https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6720: - Component/s: PySpark PySpark MultivariateStatisticalSummary unit test for normL1 and normL2 -- Key: SPARK-6720 URL: https://issues.apache.org/jira/browse/SPARK-6720 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Kai Sasaki Priority: Minor Implement correct normL1 and normL2 test. continuation: https://github.com/apache/spark/pull/5359 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
[ https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6720: - Affects Version/s: (was: 1.3.0) 1.4.0 PySpark MultivariateStatisticalSummary unit test for normL1 and normL2 -- Key: SPARK-6720 URL: https://issues.apache.org/jira/browse/SPARK-6720 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Kai Sasaki Priority: Minor Implement correct normL1 and normL2 test. continuation: https://github.com/apache/spark/pull/5359 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
[ https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6720: - Issue Type: Improvement (was: Bug) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2 -- Key: SPARK-6720 URL: https://issues.apache.org/jira/browse/SPARK-6720 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Kai Sasaki Priority: Minor Implement correct normL1 and normL2 test. continuation: https://github.com/apache/spark/pull/5359 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6713) Iterators in columnSimilarities to allow flatMap spill
[ https://issues.apache.org/jira/browse/SPARK-6713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6713: - Assignee: Reza Zadeh Iterators in columnSimilarities to allow flatMap spill -- Key: SPARK-6713 URL: https://issues.apache.org/jira/browse/SPARK-6713 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh Assignee: Reza Zadeh Fix For: 1.4.0 We should use Iterators in columnSimilarities to allow mapPartitionsWithIndex to spill to disk. This could happen in a dense and large column - this way Spark can spill the pairs onto disk instead of building all the pairs before handing them to Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6724) Model import/export for FPGrowth
Joseph K. Bradley created SPARK-6724: Summary: Model import/export for FPGrowth Key: SPARK-6724 URL: https://issues.apache.org/jira/browse/SPARK-6724 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Note: experimental model API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6723) Model import/export for ChiSqSelector
Joseph K. Bradley created SPARK-6723: Summary: Model import/export for ChiSqSelector Key: SPARK-6723 URL: https://issues.apache.org/jira/browse/SPARK-6723 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6710) Wrong initial bias in GraphX SVDPlusPlus
[ https://issues.apache.org/jira/browse/SPARK-6710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482063#comment-14482063 ] Reynold Xin commented on SPARK-6710: [~michaelmalak] would you like to submit a pull request for this? Wrong initial bias in GraphX SVDPlusPlus Key: SPARK-6710 URL: https://issues.apache.org/jira/browse/SPARK-6710 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.3.0 Reporter: Michael Malak Labels: easyfix Original Estimate: 2h Remaining Estimate: 2h In the initialization portion of GraphX SVDPlusPluS, the initialization of biases appears to be incorrect. Specifically, in line https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/SVDPlusPlus.scala#L96 instead of (vd._1, vd._2, msg.get._2 / msg.get._1, 1.0 / scala.math.sqrt(msg.get._1)) it should probably be (vd._1, vd._2, msg.get._2 / msg.get._1 - u, 1.0 / scala.math.sqrt(msg.get._1)) That is, the biases bu and bi (both represented as the third component of the Tuple4[] above, depending on whether the vertex is a user or an item), described in equation (1) of the Koren paper, are supposed to be small offsets to the mean (represented by the variable u, signifying the Greek letter mu) to account for peculiarities of individual users and items. Initializing these biases to wrong values should theoretically not matter given enough iterations of the algorithm, but some quick empirical testing shows it has trouble converging at all, even after many orders of magnitude additional iterations. This perhaps could be the source of previously reported trouble with SVDPlusPlus. http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-SVDPlusPlus-problem-td12885.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6728) Improve performance of py4j for large bytearray
Davies Liu created SPARK-6728: - Summary: Improve performance of py4j for large bytearray Key: SPARK-6728 URL: https://issues.apache.org/jira/browse/SPARK-6728 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu PySpark relies on py4j to transfer function arguments and return between Python and JVM, it's very slow to pass a large bytearray (larger than 10M). In MLlib, it's possible to have a Vector with more than 100M bytes, which will need few GB memory, may crash. The reason is that py4j use text protocol, it will encode the bytearray as base64, and do multiple string concat. Binary will help a lot, create a issue for py4j: https://github.com/bartdag/py4j/issues/159 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6229) Support SASL encryption in network/common module
[ https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6229: --- Assignee: (was: Apache Spark) Support SASL encryption in network/common module Key: SPARK-6229 URL: https://issues.apache.org/jira/browse/SPARK-6229 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Marcelo Vanzin After SASL support has been added to network/common, supporting encryption should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. Since the latter requires a valid kerberos login to work (and so doesn't really work with executors), encryption would require the use of DIGEST-MD5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6229) Support SASL encryption in network/common module
[ https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6229: --- Assignee: Apache Spark Support SASL encryption in network/common module Key: SPARK-6229 URL: https://issues.apache.org/jira/browse/SPARK-6229 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Marcelo Vanzin Assignee: Apache Spark After SASL support has been added to network/common, supporting encryption should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. Since the latter requires a valid kerberos login to work (and so doesn't really work with executors), encryption would require the use of DIGEST-MD5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6229) Support SASL encryption in network/common module
[ https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482157#comment-14482157 ] Apache Spark commented on SPARK-6229: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/5377 Support SASL encryption in network/common module Key: SPARK-6229 URL: https://issues.apache.org/jira/browse/SPARK-6229 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Marcelo Vanzin After SASL support has been added to network/common, supporting encryption should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. Since the latter requires a valid kerberos login to work (and so doesn't really work with executors), encryption would require the use of DIGEST-MD5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482244#comment-14482244 ] Patrick Walsh commented on SPARK-5281: -- I also have this issue with spark 1.3.0. Even example snippets where case classes are used in the rrd's trigger the problem. For me, this happens from eclipse and from sbt. Registering table on RDD is giving MissingRequirementError -- Key: SPARK-5281 URL: https://issues.apache.org/jira/browse/SPARK-5281 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: sarsol Priority: Critical Application crashes on this line {{rdd.registerTempTable(temp)}} in 1.2 version when using sbt or Eclipse SCALA IDE Stacktrace: {code} Exception in thread main scala.reflect.internal.MissingRequirementError: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program Files\Java\jre7\lib\resources.jar;C:\Program Files\Java\jre7\lib\rt.jar;C:\Program Files\Java\jre7\lib\sunrsasign.jar;C:\Program Files\Java\jre7\lib\jsse.jar;C:\Program Files\Java\jre7\lib\jce.jar;C:\Program Files\Java\jre7\lib\charsets.jar;C:\Program Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) at scala.reflect.api.Universe.typeOf(Universe.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94) at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33) at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111) at com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43) at scala.Function0$class.apply$mcV$sp(Function0.scala:40) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) at scala.App$class.main(App.scala:71) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6704) integrate SparkR docs build tool into Spark doc build
[ https://issues.apache.org/jira/browse/SPARK-6704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481972#comment-14481972 ] Davies Liu commented on SPARK-6704: --- Great, thanks! integrate SparkR docs build tool into Spark doc build - Key: SPARK-6704 URL: https://issues.apache.org/jira/browse/SPARK-6704 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Davies Liu Priority: Blocker We should integrate the SparkR docs build tool into Spark one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6729) DriverQuirks get can get OutOfBounds exception is some cases
Volodymyr Lyubinets created SPARK-6729: -- Summary: DriverQuirks get can get OutOfBounds exception is some cases Key: SPARK-6729 URL: https://issues.apache.org/jira/browse/SPARK-6729 Project: Spark Issue Type: Bug Components: SQL Reporter: Volodymyr Lyubinets Priority: Minor The function uses .substring(0, X), which will trigger OutOfBoundsException if string length is less than X. A better way to do this is to use startsWith, which won't error out in this case. I'll propose a patch shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6729) DriverQuirks get can get OutOfBounds exception is some cases
[ https://issues.apache.org/jira/browse/SPARK-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6729: --- Assignee: (was: Apache Spark) DriverQuirks get can get OutOfBounds exception is some cases Key: SPARK-6729 URL: https://issues.apache.org/jira/browse/SPARK-6729 Project: Spark Issue Type: Bug Components: SQL Reporter: Volodymyr Lyubinets Priority: Minor The function uses .substring(0, X), which will trigger OutOfBoundsException if string length is less than X. A better way to do this is to use startsWith, which won't error out in this case. I'll propose a patch shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6729) DriverQuirks get can get OutOfBounds exception is some cases
[ https://issues.apache.org/jira/browse/SPARK-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482193#comment-14482193 ] Apache Spark commented on SPARK-6729: - User 'vlyubin' has created a pull request for this issue: https://github.com/apache/spark/pull/5378 DriverQuirks get can get OutOfBounds exception is some cases Key: SPARK-6729 URL: https://issues.apache.org/jira/browse/SPARK-6729 Project: Spark Issue Type: Bug Components: SQL Reporter: Volodymyr Lyubinets Priority: Minor The function uses .substring(0, X), which will trigger OutOfBoundsException if string length is less than X. A better way to do this is to use startsWith, which won't error out in this case. I'll propose a patch shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6726) Model export/import for spark.ml: LogisticRegression
Joseph K. Bradley created SPARK-6726: Summary: Model export/import for spark.ml: LogisticRegression Key: SPARK-6726 URL: https://issues.apache.org/jira/browse/SPARK-6726 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6728) Improve performance of py4j for large bytearray
[ https://issues.apache.org/jira/browse/SPARK-6728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6728: Affects Version/s: 1.3.0 Improve performance of py4j for large bytearray --- Key: SPARK-6728 URL: https://issues.apache.org/jira/browse/SPARK-6728 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.3.0 Reporter: Davies Liu PySpark relies on py4j to transfer function arguments and return between Python and JVM, it's very slow to pass a large bytearray (larger than 10M). In MLlib, it's possible to have a Vector with more than 100M bytes, which will need few GB memory, may crash. The reason is that py4j use text protocol, it will encode the bytearray as base64, and do multiple string concat. Binary will help a lot, create a issue for py4j: https://github.com/bartdag/py4j/issues/159 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6728) Improve performance of py4j for large bytearray
[ https://issues.apache.org/jira/browse/SPARK-6728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6728: Priority: Critical (was: Major) Target Version/s: 1.4.0 Improve performance of py4j for large bytearray --- Key: SPARK-6728 URL: https://issues.apache.org/jira/browse/SPARK-6728 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.3.0 Reporter: Davies Liu Priority: Critical PySpark relies on py4j to transfer function arguments and return between Python and JVM, it's very slow to pass a large bytearray (larger than 10M). In MLlib, it's possible to have a Vector with more than 100M bytes, which will need few GB memory, may crash. The reason is that py4j use text protocol, it will encode the bytearray as base64, and do multiple string concat. Binary will help a lot, create a issue for py4j: https://github.com/bartdag/py4j/issues/159 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6727) Model export/import for spark.ml: HashingTF
Joseph K. Bradley created SPARK-6727: Summary: Model export/import for spark.ml: HashingTF Key: SPARK-6727 URL: https://issues.apache.org/jira/browse/SPARK-6727 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6729) DriverQuirks get can get OutOfBounds exception is some cases
[ https://issues.apache.org/jira/browse/SPARK-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6729: --- Assignee: Apache Spark DriverQuirks get can get OutOfBounds exception is some cases Key: SPARK-6729 URL: https://issues.apache.org/jira/browse/SPARK-6729 Project: Spark Issue Type: Bug Components: SQL Reporter: Volodymyr Lyubinets Assignee: Apache Spark Priority: Minor The function uses .substring(0, X), which will trigger OutOfBoundsException if string length is less than X. A better way to do this is to use startsWith, which won't error out in this case. I'll propose a patch shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482297#comment-14482297 ] Sai Nishanth Parepally commented on SPARK-3219: --- [~mengxr], is https://github.com/derrickburns/generalized-kmeans-clustering going to be merged into mllib as I would like to use jaccard distance as a distance metric for kmeans clustering? K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns Labels: clustering The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6721) IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482367#comment-14482367 ] Sean Owen commented on SPARK-6721: -- (Also IllegalStateException isn't a useful JIRA name -- please edit it to something more meaningful, like including mongo) IllegalStateException - Key: SPARK-6721 URL: https://issues.apache.org/jira/browse/SPARK-6721 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0, 1.2.1, 1.3.0 Environment: Ubuntu 14.04, Java 8, MongoDB 3.0, Spark 1.3 Reporter: Luis Rodríguez Trejo Labels: MongoDB, java.lang.IllegalStateexception, saveAsNewAPIHadoopFile I get the following exception when using saveAsNewAPIHadoopFile: {code} 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 10.0.2.15): java.lang.IllegalStateException: open at org.bson.util.Assertions.isTrue(Assertions.java:36) at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406) at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184) at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167) at com.mongodb.DBCollection.insert(DBCollection.java:161) at com.mongodb.DBCollection.insert(DBCollection.java:107) at com.mongodb.DBCollection.save(DBCollection.java:1049) at com.mongodb.DBCollection.save(DBCollection.java:1014) at com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Before Spark 1.3.0 this would result in the application crashing, but now the data just remains unprocessed. There is no close instruction at any part of the code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6730) Can't have table as identifier in OPTIONS
[ https://issues.apache.org/jira/browse/SPARK-6730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Liu updated SPARK-6730: Description: The following query fails because there is an identifier table in OPTIONS {code} CREATE TEMPORARY TABLE ddlTable USING org.apache.spark.sql.cassandra OPTIONS ( table test1, keyspace test ) {code} The following error {code} ] java.lang.RuntimeException: [1.2] failure: ``insert'' expected but identifier CREATE found [info] [info] CREATE TEMPORARY TABLE ddlTable USING org.apache.spark.sql.cassandra OPTIONS ( table test1, keyspace dstest ) [info] ^ [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) [info] at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130) [info] at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130) [info] at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) [info] at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) [info] at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) [info] at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:134) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:134) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:134) {code} was: The following query fails because there is an identifier table in OPTIONS {code} CREATE TEMPORARY TABLE ddlTable USING org.apache.spark.sql.cassandra OPTIONS ( table test1, keyspace test {code} The following error {code} ] java.lang.RuntimeException: [1.2] failure: ``insert'' expected but identifier CREATE found [info] [info] CREATE TEMPORARY TABLE ddlTable USING org.apache.spark.sql.cassandra OPTIONS ( table test1, keyspace dstest ) [info] ^ [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) [info] at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130) [info] at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130) [info] at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) [info] at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at
[jira] [Created] (SPARK-6730) Can't have table as identifier in OPTIONS
Alex Liu created SPARK-6730: --- Summary: Can't have table as identifier in OPTIONS Key: SPARK-6730 URL: https://issues.apache.org/jira/browse/SPARK-6730 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Alex Liu The following query fails because there is an identifier table in OPTIONS {code} CREATE TEMPORARY TABLE ddlTable USING org.apache.spark.sql.cassandra OPTIONS ( table test1, keyspace test {code} The following error {code} ] java.lang.RuntimeException: [1.2] failure: ``insert'' expected but identifier CREATE found [info] [info] CREATE TEMPORARY TABLE ddlTable USING org.apache.spark.sql.cassandra OPTIONS ( table test1, keyspace dstest ) [info] ^ [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40) [info] at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130) [info] at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130) [info] at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) [info] at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) [info] at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) [info] at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:134) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:134) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:134) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6721) IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482366#comment-14482366 ] Sean Owen commented on SPARK-6721: -- Isn't this an error / config problem in Mongo rather than Spark? IllegalStateException - Key: SPARK-6721 URL: https://issues.apache.org/jira/browse/SPARK-6721 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0, 1.2.1, 1.3.0 Environment: Ubuntu 14.04, Java 8, MongoDB 3.0, Spark 1.3 Reporter: Luis Rodríguez Trejo Labels: MongoDB, java.lang.IllegalStateexception, saveAsNewAPIHadoopFile I get the following exception when using saveAsNewAPIHadoopFile: {code} 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 10.0.2.15): java.lang.IllegalStateException: open at org.bson.util.Assertions.isTrue(Assertions.java:36) at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406) at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184) at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167) at com.mongodb.DBCollection.insert(DBCollection.java:161) at com.mongodb.DBCollection.insert(DBCollection.java:107) at com.mongodb.DBCollection.save(DBCollection.java:1049) at com.mongodb.DBCollection.save(DBCollection.java:1014) at com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Before Spark 1.3.0 this would result in the application crashing, but now the data just remains unprocessed. There is no close instruction at any part of the code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6599) Improve reliability and usability of Kinesis-based Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-6599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-6599: Summary: Improve reliability and usability of Kinesis-based Spark Streaming (was: Add Kinesis Direct API) Improve reliability and usability of Kinesis-based Spark Streaming -- Key: SPARK-6599 URL: https://issues.apache.org/jira/browse/SPARK-6599 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2960) Spark executables fail to start via symlinks
[ https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2960: - Component/s: Deploy Spark executables fail to start via symlinks Key: SPARK-2960 URL: https://issues.apache.org/jira/browse/SPARK-2960 Project: Spark Issue Type: Bug Components: Deploy Reporter: Shay Rojansky Priority: Minor The current scripts (e.g. pyspark) fail to run when they are executed via symlinks. A common Linux scenario would be to have Spark installed somewhere (e.g. /opt) and have a symlink to it in /usr/bin. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6732) Scala existentials warning during compilation
Raymond Tay created SPARK-6732: -- Summary: Scala existentials warning during compilation Key: SPARK-6732 URL: https://issues.apache.org/jira/browse/SPARK-6732 Project: Spark Issue Type: Improvement Components: Scheduler Environment: operating system: OSX Yosemite scala version: 2.10.4 hardware: 2.7 GHz Intel Core i7, 16 GB 1600 MHz DDR3 Reporter: Raymond Tay Priority: Minor Certain parts of the Scala code was detected to have used existentials but the scala import can be included in the source file to prevent such warnings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6343) Make doc more explicit regarding network connectivity requirements
[ https://issues.apache.org/jira/browse/SPARK-6343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482496#comment-14482496 ] Apache Spark commented on SPARK-6343: - User 'parente' has created a pull request for this issue: https://github.com/apache/spark/pull/5382 Make doc more explicit regarding network connectivity requirements -- Key: SPARK-6343 URL: https://issues.apache.org/jira/browse/SPARK-6343 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Peter Parente Priority: Minor As a new user of Spark, I read through the official documentation before attempting to stand-up my own cluster and write my own driver application. But only after attempting to run my app remotely against my cluster did I realize that full network connectivity (layer 3) is necessary between my driver program and worker nodes (i.e., my driver was *listening* for connections from my workers). I returned to the documentation to see how I had missed this requirement. On a second read-through, I saw that the doc hints at it in a few places (e.g., [driver config|http://spark.apache.org/docs/1.2.0/configuration.html#networking], [submitting applications suggestion|http://spark.apache.org/docs/1.2.0/submitting-applications.html], [cluster overview|http://spark.apache.org/docs/1.2.0/cluster-overview.html]) but never outright says it. I think it would help would-be users better understand how Spark works to state the network connectivity requirements right up-front in the overview section of the doc. I suggest revising the diagram and accompanying text found on the [overview page|http://spark.apache.org/docs/1.2.0/cluster-overview.html]: !http://spark.apache.org/docs/1.2.0/img/cluster-overview.png! so that it depicts at least the directionality of the network connections initiated (perhaps like so): !http://i.imgur.com/2dqGbCr.png! and states that the driver must listen for and accept connections from other Spark components on a variety of ports. Please treat my diagram and text as strawmen: I expect more experienced Spark users and developers will have better ideas on how to convey these requirements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6733) Suppression of usage of Scala existential code should be done
Raymond Tay created SPARK-6733: -- Summary: Suppression of usage of Scala existential code should be done Key: SPARK-6733 URL: https://issues.apache.org/jira/browse/SPARK-6733 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 1.3.0 Environment: OS: OSX Yosemite Hardware: Intel Core i7 with 16 GB RAM Reporter: Raymond Tay The inclusion of this statement in the file {code:scala} import scala.language.existentials {code} should have suppressed all warnings regarding the use of scala existential code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6729) DriverQuirks get can get OutOfBounds exception is some cases
[ https://issues.apache.org/jira/browse/SPARK-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-6729. --- Resolution: Fixed Fix Version/s: 1.4.0 DriverQuirks get can get OutOfBounds exception is some cases Key: SPARK-6729 URL: https://issues.apache.org/jira/browse/SPARK-6729 Project: Spark Issue Type: Bug Components: SQL Reporter: Volodymyr Lyubinets Assignee: Volodymyr Lyubinets Priority: Minor Fix For: 1.4.0 The function uses .substring(0, X), which will trigger OutOfBoundsException if string length is less than X. A better way to do this is to use startsWith, which won't error out in this case. I'll propose a patch shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6729) DriverQuirks get can get OutOfBounds exception is some cases
[ https://issues.apache.org/jira/browse/SPARK-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-6729: -- Assignee: Volodymyr Lyubinets DriverQuirks get can get OutOfBounds exception is some cases Key: SPARK-6729 URL: https://issues.apache.org/jira/browse/SPARK-6729 Project: Spark Issue Type: Bug Components: SQL Reporter: Volodymyr Lyubinets Assignee: Volodymyr Lyubinets Priority: Minor Fix For: 1.4.0 The function uses .substring(0, X), which will trigger OutOfBoundsException if string length is less than X. A better way to do this is to use startsWith, which won't error out in this case. I'll propose a patch shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6506) python support yarn cluster mode requires SPARK_HOME to be set
[ https://issues.apache.org/jira/browse/SPARK-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482414#comment-14482414 ] Kostas Sakellis commented on SPARK-6506: I ran into this issue too by running: bq. spark-submit --master yarn-cluster examples/pi.py 4 it looks like I only had to set: spark.yarn.appMasterEnv.SPARK_HOME=/bogus to get it going: bq. spark-submit --conf spark.yarn.appMasterEnv.SPARK_HOME=/bogus --master yarn-cluster pi.py 4 python support yarn cluster mode requires SPARK_HOME to be set -- Key: SPARK-6506 URL: https://issues.apache.org/jira/browse/SPARK-6506 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0 Reporter: Thomas Graves We added support for python running in yarn cluster mode in https://issues.apache.org/jira/browse/SPARK-5173, but it requires that SPARK_HOME be set in the environment variables for application master and executor. It doesn't have to be set to anything real but it fails if its not set. See the command at the end of: https://github.com/apache/spark/pull/3976 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6731) Upgrade Apache commons-math3 to 3.4.1
[ https://issues.apache.org/jira/browse/SPARK-6731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6731: --- Assignee: (was: Apache Spark) Upgrade Apache commons-math3 to 3.4.1 - Key: SPARK-6731 URL: https://issues.apache.org/jira/browse/SPARK-6731 Project: Spark Issue Type: Dependency upgrade Components: Spark Core Affects Versions: 1.3.0 Reporter: Punya Biswal Spark depends on Apache commons-math3 version 3.1.1, which is 2 years old. The current version (3.4.1) includes approximate percentile statistics (among other things). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6731) Upgrade Apache commons-math3 to 3.4.1
[ https://issues.apache.org/jira/browse/SPARK-6731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482460#comment-14482460 ] Apache Spark commented on SPARK-6731: - User 'punya' has created a pull request for this issue: https://github.com/apache/spark/pull/5380 Upgrade Apache commons-math3 to 3.4.1 - Key: SPARK-6731 URL: https://issues.apache.org/jira/browse/SPARK-6731 Project: Spark Issue Type: Dependency upgrade Components: Spark Core Affects Versions: 1.3.0 Reporter: Punya Biswal Spark depends on Apache commons-math3 version 3.1.1, which is 2 years old. The current version (3.4.1) includes approximate percentile statistics (among other things). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482291#comment-14482291 ] William Benton commented on SPARK-5281: --- As [~marmbrus] recently pointed out on the user list, this happens when you don't have all of the dependencies for Scala reflection loaded by the primordial classloader. For running apps from sbt, setting {{fork := true}} should do the trick. For running a REPL from sbt, try [this workaround|http://chapeau.freevariable.com/2015/04/spark-sql-repl.html]. (Sorry to not have a solution for Eclipse.) Registering table on RDD is giving MissingRequirementError -- Key: SPARK-5281 URL: https://issues.apache.org/jira/browse/SPARK-5281 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: sarsol Priority: Critical Application crashes on this line {{rdd.registerTempTable(temp)}} in 1.2 version when using sbt or Eclipse SCALA IDE Stacktrace: {code} Exception in thread main scala.reflect.internal.MissingRequirementError: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program Files\Java\jre7\lib\resources.jar;C:\Program Files\Java\jre7\lib\rt.jar;C:\Program Files\Java\jre7\lib\sunrsasign.jar;C:\Program Files\Java\jre7\lib\jsse.jar;C:\Program Files\Java\jre7\lib\jce.jar;C:\Program Files\Java\jre7\lib\charsets.jar;C:\Program Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) at scala.reflect.api.Universe.typeOf(Universe.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94) at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33) at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111) at com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43) at scala.Function0$class.apply$mcV$sp(Function0.scala:40) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) at scala.App$class.main(App.scala:71) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself
[ https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-6514: Target Version/s: 1.4.0 (was: 1.3.1) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself Key: SPARK-6514 URL: https://issues.apache.org/jira/browse/SPARK-6514 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Chris Fregly this was not supported when i originally wrote this receiver. this is now supported. also, upgrade to the latest Kinesis Client Library (KCL) which is 1.2, i believe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6734) Support GenericUDTF.close for Generate
[ https://issues.apache.org/jira/browse/SPARK-6734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482567#comment-14482567 ] Apache Spark commented on SPARK-6734: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/5383 Support GenericUDTF.close for Generate -- Key: SPARK-6734 URL: https://issues.apache.org/jira/browse/SPARK-6734 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Some third-party UDTF extension, will generate more rows in the GenericUDTF.close() method, which is supported by Hive. https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF However, Spark SQL ignores the GenericUDTF.close(), and it causes bug while porting job from Hive to Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6734) Support GenericUDTF.close for Generate
[ https://issues.apache.org/jira/browse/SPARK-6734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6734: --- Assignee: Apache Spark Support GenericUDTF.close for Generate -- Key: SPARK-6734 URL: https://issues.apache.org/jira/browse/SPARK-6734 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Apache Spark Some third-party UDTF extension, will generate more rows in the GenericUDTF.close() method, which is supported by Hive. https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF However, Spark SQL ignores the GenericUDTF.close(), and it causes bug while porting job from Hive to Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6734) Support GenericUDTF.close for Generate
Cheng Hao created SPARK-6734: Summary: Support GenericUDTF.close for Generate Key: SPARK-6734 URL: https://issues.apache.org/jira/browse/SPARK-6734 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Some third-party UDTF extension, will generate more rows in the GenericUDTF.close() method, which is supported by Hive. https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF However, Spark SQL ignores the GenericUDTF.close(), and it causes bug while porting job from Hive to Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6734) Support GenericUDTF.close for Generate
[ https://issues.apache.org/jira/browse/SPARK-6734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6734: --- Assignee: (was: Apache Spark) Support GenericUDTF.close for Generate -- Key: SPARK-6734 URL: https://issues.apache.org/jira/browse/SPARK-6734 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Some third-party UDTF extension, will generate more rows in the GenericUDTF.close() method, which is supported by Hive. https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF However, Spark SQL ignores the GenericUDTF.close(), and it causes bug while porting job from Hive to Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6733) Suppression of usage of Scala existential code should be done
[ https://issues.apache.org/jira/browse/SPARK-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6733: --- Assignee: (was: Apache Spark) Suppression of usage of Scala existential code should be done - Key: SPARK-6733 URL: https://issues.apache.org/jira/browse/SPARK-6733 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 1.3.0 Environment: OS: OSX Yosemite Hardware: Intel Core i7 with 16 GB RAM Reporter: Raymond Tay The inclusion of this statement in the file {code:scala} import scala.language.existentials {code} should have suppressed all warnings regarding the use of scala existential code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6733) Suppression of usage of Scala existential code should be done
[ https://issues.apache.org/jira/browse/SPARK-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482630#comment-14482630 ] Apache Spark commented on SPARK-6733: - User 'vinodkc' has created a pull request for this issue: https://github.com/apache/spark/pull/5384 Suppression of usage of Scala existential code should be done - Key: SPARK-6733 URL: https://issues.apache.org/jira/browse/SPARK-6733 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 1.3.0 Environment: OS: OSX Yosemite Hardware: Intel Core i7 with 16 GB RAM Reporter: Raymond Tay The inclusion of this statement in the file {code:scala} import scala.language.existentials {code} should have suppressed all warnings regarding the use of scala existential code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6733) Suppression of usage of Scala existential code should be done
[ https://issues.apache.org/jira/browse/SPARK-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6733: --- Assignee: Apache Spark Suppression of usage of Scala existential code should be done - Key: SPARK-6733 URL: https://issues.apache.org/jira/browse/SPARK-6733 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 1.3.0 Environment: OS: OSX Yosemite Hardware: Intel Core i7 with 16 GB RAM Reporter: Raymond Tay Assignee: Apache Spark The inclusion of this statement in the file {code:scala} import scala.language.existentials {code} should have suppressed all warnings regarding the use of scala existential code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6630) SparkConf.setIfMissing should only evaluate the assigned value if indeed missing
[ https://issues.apache.org/jira/browse/SPARK-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481055#comment-14481055 ] Svend Vanderveken commented on SPARK-6630: -- Oh, ok. For the record (and my education...) , could you clarify how does this breaks binary compatibility ? Do you mean that client code written against older version of spark would no longer work on this version? SparkConf.setIfMissing should only evaluate the assigned value if indeed missing Key: SPARK-6630 URL: https://issues.apache.org/jira/browse/SPARK-6630 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Svend Vanderveken Priority: Minor The method setIfMissing() in SparkConf is currently systematically evaluating the right hand side of the assignment even if not used. This leads to unnecessary computation, like in the case of {code} conf.setIfMissing(spark.driver.host, Utils.localHostName()) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6673) spark-shell.cmd can't start even when spark was built in Windows
[ https://issues.apache.org/jira/browse/SPARK-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6673: - Target Version/s: 1.4.0 (was: 1.3.1, 1.4.0) spark-shell.cmd can't start even when spark was built in Windows Key: SPARK-6673 URL: https://issues.apache.org/jira/browse/SPARK-6673 Project: Spark Issue Type: Bug Components: Windows Affects Versions: 1.3.0 Reporter: Masayoshi TSUZUKI Assignee: Masayoshi TSUZUKI Priority: Blocker Fix For: 1.4.0 spark-shell.cmd can't start. {code} bin\spark-shell.cmd --master local {code} will get {code} Failed to find Spark assembly JAR. You need to build Spark before running this program. {code} even when we have built spark. This is because of the lack of the environment {{SPARK_SCALA_VERSION}} which is used in {{spark-class2.cmd}}. In linux scripts, this value is set as {{2.10}} or {{2.11}} by default in {{load-spark-env.sh}}, but there are no equivalent script in Windows. As workaround, by executing {code} set SPARK_SCALA_VERSION=2.10 {code} before execute spark-shell.cmd, we can successfully start it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14480983#comment-14480983 ] Yu Ishikawa commented on SPARK-6682: I got it. I think the only way to realize an automatic mechanism is to execute builder methods in Scala/Java from Python. That is, we should make a wrapper mechanism for the machine learning algorithms like the python's `JavaModelWrapper`. However, I don't think that is not good idea very much because of the readability of the code and the documentation. - Pros -- We don't need to implement builder methods in Python, once we Implement them in Scala/Java. - Cons -- Python's documentation about builder methods is not generated because of not implementing them in Python. Deprecate static train and use builder instead for Scala/Java - Key: SPARK-6682 URL: https://issues.apache.org/jira/browse/SPARK-6682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons of the builder pattern: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6700) flaky test: run Python application in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14480988#comment-14480988 ] Lianhui Wang commented on SPARK-6700: - i do not think this is related to SPARK-6506 because YarnClusterSuite setted SPARK_HOME. Just now i run YarnClusterSuite test,but i got python application test in YarnClusterSuite is successfully.[~davies] can you report your unit-test.log or appMaster.log? flaky test: run Python application in yarn-cluster mode Key: SPARK-6700 URL: https://issues.apache.org/jira/browse/SPARK-6700 Project: Spark Issue Type: Bug Components: Tests Reporter: Davies Liu Assignee: Lianhui Wang Priority: Critical Labels: test, yarn org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-cluster mode Failing for the past 1 build (Since Failed#2025 ) Took 12 sec. Error Message {code} Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 Stacktrace sbt.ForkMain$ForkError: Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at
[jira] [Resolved] (SPARK-6673) spark-shell.cmd can't start even when spark was built in Windows
[ https://issues.apache.org/jira/browse/SPARK-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6673. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5328 [https://github.com/apache/spark/pull/5328] spark-shell.cmd can't start even when spark was built in Windows Key: SPARK-6673 URL: https://issues.apache.org/jira/browse/SPARK-6673 Project: Spark Issue Type: Bug Components: Windows Affects Versions: 1.3.0 Reporter: Masayoshi TSUZUKI Assignee: Masayoshi TSUZUKI Priority: Blocker Fix For: 1.4.0 spark-shell.cmd can't start. {code} bin\spark-shell.cmd --master local {code} will get {code} Failed to find Spark assembly JAR. You need to build Spark before running this program. {code} even when we have built spark. This is because of the lack of the environment {{SPARK_SCALA_VERSION}} which is used in {{spark-class2.cmd}}. In linux scripts, this value is set as {{2.10}} or {{2.11}} by default in {{load-spark-env.sh}}, but there are no equivalent script in Windows. As workaround, by executing {code} set SPARK_SCALA_VERSION=2.10 {code} before execute spark-shell.cmd, we can successfully start it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6700) flaky test: run Python application in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14480988#comment-14480988 ] Lianhui Wang edited comment on SPARK-6700 at 4/6/15 6:49 AM: - i do not think this is related to SPARK-6506 because YarnClusterSuite setted SPARK_HOME. Just now i run YarnClusterSuite test,but i got python application test in YarnClusterSuite is successfully.[~davies] can you report your unit-test.log or appMaster.log? in addition, i think you can try again because there maybe has other errors to cause it failed. was (Author: lianhuiwang): i do not think this is related to SPARK-6506 because YarnClusterSuite setted SPARK_HOME. Just now i run YarnClusterSuite test,but i got python application test in YarnClusterSuite is successfully.[~davies] can you report your unit-test.log or appMaster.log? flaky test: run Python application in yarn-cluster mode Key: SPARK-6700 URL: https://issues.apache.org/jira/browse/SPARK-6700 Project: Spark Issue Type: Bug Components: Tests Reporter: Davies Liu Assignee: Lianhui Wang Priority: Critical Labels: test, yarn org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-cluster mode Failing for the past 1 build (Since Failed#2025 ) Took 12 sec. Error Message {code} Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 Stacktrace sbt.ForkMain$ForkError: Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
[jira] [Resolved] (SPARK-6687) In the hadoop 0.23 profile, hadoop pulls in an older version of netty which conflicts with akka's netty
[ https://issues.apache.org/jira/browse/SPARK-6687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6687. -- Resolution: Not A Problem I'm not sure what the problem is here, so closing until there's any follow up. In the hadoop 0.23 profile, hadoop pulls in an older version of netty which conflicts with akka's netty Key: SPARK-6687 URL: https://issues.apache.org/jira/browse/SPARK-6687 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Sai Nishanth Parepally excerpt from mvn -Dverbose dependency:tree of spark-core, note the org.jboss.netty:netty dependency: [INFO] | | +- org.apache.hadoop:hadoop-mapreduce-client-app:jar:0.23.10:compile [INFO] | | | +- org.apache.hadoop:hadoop-mapreduce-client-common:jar:0.23.10:compile [INFO] | | | | +- (org.apache.hadoop:hadoop-yarn-common:jar:0.23.10:compile - omitted for duplicate) [INFO] | | | | +- (org.apache.hadoop:hadoop-mapreduce-client-core:jar:0.23.10:compile - omitted for duplicate) [INFO] | | | | +- org.apache.hadoop:hadoop-yarn-server-common:jar:0.23.10:compile [INFO] | | | | | +- (org.apache.hadoop:hadoop-yarn-common:jar:0.23.10:compile - omitted for duplicate) [INFO] | | | | | +- (org.apache.zookeeper:zookeeper:jar:3.4.5:compile - version managed from 3.4.2; omitted for duplicate) [INFO] | | | | | +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version managed from 1.6.1; omitted for duplicate) [INFO] | | | | | +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - version managed from 1.6.1; omitted for duplicate) [INFO] | | | | | +- (org.jboss.netty:netty:jar:3.2.4.Final:compile - omitted for duplicate) [INFO] | | | | | +- (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - omitted for duplicate) [INFO] | | | | | +- (commons-io:commons-io:jar:2.1:compile - omitted for duplicate) [INFO] | | | | | +- (com.google.inject:guice:jar:3.0:compile - omitted for duplicate) [INFO] | | | | | +- (com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.8:compile - omitted for duplicate) [INFO] | | | | | +- (com.sun.jersey:jersey-server:jar:1.8:compile - omitted for duplicate) [INFO] | | | | | \- (com.sun.jersey.contribs:jersey-guice:jar:1.8:compile - omitted for duplicate) [INFO] | | | | +- (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - omitted for duplicate) [INFO] | | | | +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version managed from 1.6.1; omitted for duplicate) [INFO] | | | | +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - version managed from 1.6.1; omitted for duplicate) [INFO] | | | | +- (org.apache.hadoop:hadoop-hdfs:jar:1.23.10:compile - omitted for duplicate) [INFO] | | | | \- (org.jboss.netty:netty:jar:3.2.4.Final:compile - omitted for duplicate) [INFO] | | | +- org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:0.23.10:compile [INFO] | | | | +- (org.apache.hadoop:hadoop-mapreduce-client-core:jar:0.23.10:compile - omitted for duplicate) [INFO] | | | | +- (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - omitted for duplicate) [INFO] | | | | +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version managed from 1.6.1; omitted for duplicate) [INFO] | | | | +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - version managed from 1.6.1; omitted for duplicate) [INFO] | | | | +- (org.apache.hadoop:hadoop-hdfs:jar:0.23.10:compile - omitted for duplicate) [INFO] | | | | \- (org.jboss.netty:netty:jar:3.2.4.Final:compile - omitted for duplicate) [INFO] | | | +- (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - omitted for duplicate) [INFO] | | | +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version managed from 1.6.1; omitted for duplicate) [INFO] | | | +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - version managed from 1.6.1; omitted for duplicate) [INFO] | | | +- (org.apache.hadoop:hadoop-hdfs:jar:0.23.10:compile - omitted for duplicate) [INFO] | | | \- org.jboss.netty:netty:jar:3.2.4.Final:compile -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6630) SparkConf.setIfMissing should only evaluate the assigned value if indeed missing
[ https://issues.apache.org/jira/browse/SPARK-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481064#comment-14481064 ] Sean Owen commented on SPARK-6630: -- Yeah because the second argument becomes a Function producing a String, not a String. Code compiled against older versions of Spark are expected to run as much as possible on newer ones and the old code would not find the String method. We could add an overload, but then I am not sure what happens to the current code. I think code continues to bind to the String overload, defeating the purpose. SparkConf.setIfMissing should only evaluate the assigned value if indeed missing Key: SPARK-6630 URL: https://issues.apache.org/jira/browse/SPARK-6630 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Svend Vanderveken Priority: Minor The method setIfMissing() in SparkConf is currently systematically evaluating the right hand side of the assignment even if not used. This leads to unnecessary computation, like in the case of {code} conf.setIfMissing(spark.driver.host, Utils.localHostName()) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6719) Update spark.apache.org/mllib page to 1.3
Xiangrui Meng created SPARK-6719: Summary: Update spark.apache.org/mllib page to 1.3 Key: SPARK-6719 URL: https://issues.apache.org/jira/browse/SPARK-6719 Project: Spark Issue Type: Task Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng The current web page is outdated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings
[ https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6569: - Priority: Trivial (was: Minor) Assignee: Platon Potapov Issue Type: Improvement (was: Bug) Kafka directInputStream logs what appear to be incorrect warnings - Key: SPARK-6569 URL: https://issues.apache.org/jira/browse/SPARK-6569 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Environment: Spark 1.3.0 Reporter: Platon Potapov Assignee: Platon Potapov Priority: Trivial Fix For: 1.4.0 During what appears to be normal operation of streaming from a Kafka topic, the following log records are observed, logged periodically: {code} [Stage 391:== (3 + 0) / 4] 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 {code} * the part.fromOffset placeholder is not correctly substituted to a value * is the condition really mandates a warning being logged? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings
[ https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6569. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5366 [https://github.com/apache/spark/pull/5366] Kafka directInputStream logs what appear to be incorrect warnings - Key: SPARK-6569 URL: https://issues.apache.org/jira/browse/SPARK-6569 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Environment: Spark 1.3.0 Reporter: Platon Potapov Priority: Minor Fix For: 1.4.0 During what appears to be normal operation of streaming from a Kafka topic, the following log records are observed, logged periodically: {code} [Stage 391:== (3 + 0) / 4] 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 {code} * the part.fromOffset placeholder is not correctly substituted to a value * is the condition really mandates a warning being logged? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6630) SparkConf.setIfMissing should only evaluate the assigned value if indeed missing
[ https://issues.apache.org/jira/browse/SPARK-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6630. -- Resolution: Won't Fix Idea was good, just probably can't be reconciled with binary compatibility at this point without significantly more change, so closing. If there's a particularly expensive computation we want to avoid, we can fix those directly by checking the property's existence first before computing and setting a new value. SparkConf.setIfMissing should only evaluate the assigned value if indeed missing Key: SPARK-6630 URL: https://issues.apache.org/jira/browse/SPARK-6630 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Svend Vanderveken Priority: Minor The method setIfMissing() in SparkConf is currently systematically evaluating the right hand side of the assignment even if not used. This leads to unnecessary computation, like in the case of {code} conf.setIfMissing(spark.driver.host, Utils.localHostName()) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5261) In some cases ,The value of word's vector representation is too big
[ https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481132#comment-14481132 ] Sean Owen commented on SPARK-5261: -- In the new code you pasted, I don't see a difference between the two runs. Is the point that the result isn't deterministic even with a fixed seed? that it might be sensitive to the order in which it encounters the words? In some cases ,The value of word's vector representation is too big --- Key: SPARK-5261 URL: https://issues.apache.org/jira/browse/SPARK-5261 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Guoqiang Li Get data: {code:none} normalize_text() { awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e s/'/ ' /g -e s/“/\/g -e s/”/\/g \ -e 's// /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \ -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \ -e 's/«/ /g' | tr 0-9 } wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz gzip -d news.2013.en.shuffled.gz normalize_text news.2013.en.shuffled data.txt {code} {code:none} import org.apache.spark.mllib.feature.Word2Vec val text = sc.textFile(dataPath).map { t = t.split( ).toIterable } val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(5) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = res1: Float = 375059.84 val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(5) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = res3: Float = 1661285.2 {code} The average absolute value of the word's vector representation is 60731.8 {code} val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(1) {code} The average absolute value of the word's vector representation is 0.13889 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
[ https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6720: --- Assignee: (was: Apache Spark) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2 -- Key: SPARK-6720 URL: https://issues.apache.org/jira/browse/SPARK-6720 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Kai Sasaki Priority: Minor Fix For: 1.4.0 Implement correct normL1 and normL2 test. continuation: https://github.com/apache/spark/pull/5359 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
[ https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481176#comment-14481176 ] Apache Spark commented on SPARK-6720: - User 'Lewuathe' has created a pull request for this issue: https://github.com/apache/spark/pull/5374 PySpark MultivariateStatisticalSummary unit test for normL1 and normL2 -- Key: SPARK-6720 URL: https://issues.apache.org/jira/browse/SPARK-6720 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Kai Sasaki Priority: Minor Fix For: 1.4.0 Implement correct normL1 and normL2 test. continuation: https://github.com/apache/spark/pull/5359 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
[ https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6720: --- Assignee: Apache Spark PySpark MultivariateStatisticalSummary unit test for normL1 and normL2 -- Key: SPARK-6720 URL: https://issues.apache.org/jira/browse/SPARK-6720 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Kai Sasaki Assignee: Apache Spark Priority: Minor Fix For: 1.4.0 Implement correct normL1 and normL2 test. continuation: https://github.com/apache/spark/pull/5359 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2960) Spark executables fail to start via symlinks
[ https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481228#comment-14481228 ] Danil Mironov commented on SPARK-2960: -- This now formed a loop of three tickets (SPARK-2960, SPARK-3482 and SPARK-4162) all three resolved as duplicate; two PR-s (#1875 and #2386) are closed but not merged. Apparently this issue doesn't progress at all. Is there anything that can be done to burst through? I could draft a new PR; can this ticket be re-opened? Spark executables fail to start via symlinks Key: SPARK-2960 URL: https://issues.apache.org/jira/browse/SPARK-2960 Project: Spark Issue Type: Bug Reporter: Shay Rojansky Priority: Minor The current scripts (e.g. pyspark) fail to run when they are executed via symlinks. A common Linux scenario would be to have Spark installed somewhere (e.g. /opt) and have a symlink to it in /usr/bin. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
Kai Sasaki created SPARK-6720: - Summary: PySpark MultivariateStatisticalSummary unit test for normL1 and normL2 Key: SPARK-6720 URL: https://issues.apache.org/jira/browse/SPARK-6720 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Kai Sasaki Priority: Minor Fix For: 1.4.0 Implement correct normL1 and normL2 test. continuation: https://github.com/apache/spark/pull/5359 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2991) RDD transforms for scan and scanLeft
[ https://issues.apache.org/jira/browse/SPARK-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2991: --- Assignee: Apache Spark (was: Erik Erlandson) RDD transforms for scan and scanLeft - Key: SPARK-2991 URL: https://issues.apache.org/jira/browse/SPARK-2991 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Erik Erlandson Assignee: Apache Spark Priority: Minor Labels: features Provide RDD transforms analogous to Scala scan(z)(f) (parallel prefix scan) and scanLeft(z)(f) (sequential prefix scan) Discussion of a scanLeft implementation: http://erikerlandson.github.io/blog/2014/08/09/implementing-an-rdd-scanleft-transform-with-cascade-rdds/ Discussion of scan: http://erikerlandson.github.io/blog/2014/08/12/implementing-parallel-prefix-scan-as-a-spark-rdd-transform/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2991) RDD transforms for scan and scanLeft
[ https://issues.apache.org/jira/browse/SPARK-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2991: --- Assignee: Erik Erlandson (was: Apache Spark) RDD transforms for scan and scanLeft - Key: SPARK-2991 URL: https://issues.apache.org/jira/browse/SPARK-2991 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Erik Erlandson Assignee: Erik Erlandson Priority: Minor Labels: features Provide RDD transforms analogous to Scala scan(z)(f) (parallel prefix scan) and scanLeft(z)(f) (sequential prefix scan) Discussion of a scanLeft implementation: http://erikerlandson.github.io/blog/2014/08/09/implementing-an-rdd-scanleft-transform-with-cascade-rdds/ Discussion of scan: http://erikerlandson.github.io/blog/2014/08/12/implementing-parallel-prefix-scan-as-a-spark-rdd-transform/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6205) UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError
[ https://issues.apache.org/jira/browse/SPARK-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6205: - Fix Version/s: 1.3.2 UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError --- Key: SPARK-6205 URL: https://issues.apache.org/jira/browse/SPARK-6205 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.3.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.3.2, 1.4.0 {code} mvn -DskipTests -Pyarn -Phive -Phadoop-2.4 -Dhadoop.version=2.6.0 clean install mvn -Pyarn -Phive -Phadoop-2.4 -Dhadoop.version=2.6.0 test -DwildcardSuites=org.apache.spark.ui.UISeleniumSuite -Dtest=none -pl core/ {code} will produce: {code} UISeleniumSuite: *** RUN ABORTED *** java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal ... {code} It doesn't seem to happen without the various profiles set above. The fix is simple, although sounds weird; Selenium's dependency on {{xml-apis:xml-apis}} must be manually included in core's test dependencies. This probably has something to do with Hadoop 2 vs 1 dependency changes and the fact that Maven test deps aren't transitive, AFAIK. PR coming... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models
[ https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481342#comment-14481342 ] Peter Rudenko commented on SPARK-3702: -- For trees based algorithms curious whether there would be performance benefit by passing directly Dataframe columns rather than single column with vector type. E.g.: {code} class GBT extends Estimator with HasInputCols val model = new GBT.setInputCols(col1,col2, col3, ...) {code} Standardize MLlib classes for learners, models -- Key: SPARK-3702 URL: https://issues.apache.org/jira/browse/SPARK-3702 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Blocker Summary: Create a class hierarchy for learning algorithms and the models those algorithms produce. This is a super-task of several sub-tasks (but JIRA does not allow subtasks of subtasks). See the requires links below for subtasks. Goals: * give intuitive structure to API, both for developers and for generated documentation * support meta-algorithms (e.g., boosting) * support generic functionality (e.g., evaluation) * reduce code duplication across classes [Design doc for class hierarchy | https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6431) Couldn't find leader offsets exception when creating KafkaDirectStream
[ https://issues.apache.org/jira/browse/SPARK-6431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481266#comment-14481266 ] Cody Koeninger commented on SPARK-6431: --- I think this got mis-diagnosed on the mailing list, sorry for the confusion. The only way I've been able to reproduce that exception is by trying to start a stream for a topic that doesn't exist at all. Alberto, did you actually run kafka-topics.sh --create before starting the job, or in some other way create the topic? Pretty sure what happened here is that your topic didn't exist the first time you ran the job. Your brokers were set to auto-create topics, so it did exist the next time you ran the job. Putting a message into the topic didn't have anything to do with it. Here's why I think that's what happened. Following console session is an example, where empty topic existed prior to starting the console, but had no messages. Topic hasonemesssage existed and had one message in it. Topic doesntexistyet didn't exist at the beginning of the console. The metadata apis return the same info for existing-but-empty topics as they do for topics with messages in them: scala kc.getPartitions(Set(empty)).right res0: scala.util.Either.RightProjection[org.apache.spark.streaming.kafka.KafkaCluster.Err,Set[kafka.common.TopicAndPartition]] = RightProjection(Right( Set([empty,0], [empty,1]))) scala kc.getPartitions(Set(hasonemessage)).right res1: scala.util.Either.RightProjection[org.apache.spark.streaming.kafka.KafkaCluster.Err,Set[kafka.common.TopicAndPartition]] = RightProjection(Right(Set([hasonemessage,0], [hasonemessage,1]))) Leader offsets are both 0 for the empty topic, as you'd expect: scala kc.getLatestLeaderOffsets(kc.getPartitions(Set(empty)).right.get) res5: Either[org.apache.spark.streaming.kafka.KafkaCluster.Err,Map[kafka.common.TopicAndPartition,org.apache.spark.streaming.kafka.KafkaCluster.LeaderOffset]] = Right(Map([empty,1] - LeaderOffset(localhost,9094,0), [empty,0] - LeaderOffset(localhost,9093,0))) And one of the leader offsets is 1 for the topic with one message: scala kc.getLatestLeaderOffsets(kc.getPartitions(Set(hasonemessage)).right.get) res6: Either[org.apache.spark.streaming.kafka.KafkaCluster.Err,Map[kafka.common.TopicAndPartition,org.apache.spark.streaming.kafka.KafkaCluster.LeaderOffset]] = Right(Map([hasonemessage,0] - LeaderOffset(localhost,9092,1), [hasonemessage,1] - LeaderOffset(localhost,9093,0))) The first time a metadata request is made against the non-existing topic, it returns empty: kc.getPartitions(Set(doesntexistyet)).right res2: scala.util.Either.RightProjection[org.apache.spark.streaming.kafka.KafkaCluster.Err,Set[kafka.common.TopicAndPartition]] = RightProjection(Right(Set())) But if your brokers are configured with auto.create.topics.enable set to true, that metadata request alone is enough to trigger creation of the topic. Requesting it again shows that the topic has been created: scala kc.getPartitions(Set(doesntexistyet)).right res3: scala.util.Either.RightProjection[org.apache.spark.streaming.kafka.KafkaCluster.Err,Set[kafka.common.TopicAndPartition]] = RightProjection(Right(Set([doesntexistyet,0], [doesntexistyet,1]))) If you don't think that explains what happened, please let me know if you have a way of reproducing that exception against an existing-but-empty topic, because I cant. As far as what to do about this, my instinct is to just improve the error handling for the getPartitions call. If the topic doesn't exist yet, It shouldn't be returning an empty set, it should be returning an error. Couldn't find leader offsets exception when creating KafkaDirectStream -- Key: SPARK-6431 URL: https://issues.apache.org/jira/browse/SPARK-6431 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Alberto When I try to create an InputDStream using the createDirectStream method of the KafkaUtils class and the kafka topic does not have any messages yet am getting the following error: org.apache.spark.SparkException: Couldn't find leader offsets for Set() org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't find leader offsets for Set() at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:413) If I put a message in the topic before creating the DirectStream everything works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3702) Standardize MLlib classes for learners, models
[ https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481342#comment-14481342 ] Peter Rudenko edited comment on SPARK-3702 at 4/6/15 4:06 PM: -- For trees based algorithms curious whether there would be performance benefit (assuming reimplementation of Decision tree) by passing directly Dataframe columns rather than single column with vector type. E.g.: {code} class GBT extends Estimator with HasInputCols val model = new GBT.setInputCols(col1,col2, col3, ...) {code} and split dataset using dataframe api. was (Author: prudenko): For trees based algorithms curious whether there would be performance benefit by passing directly Dataframe columns rather than single column with vector type. E.g.: {code} class GBT extends Estimator with HasInputCols val model = new GBT.setInputCols(col1,col2, col3, ...) {code} Standardize MLlib classes for learners, models -- Key: SPARK-3702 URL: https://issues.apache.org/jira/browse/SPARK-3702 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Blocker Summary: Create a class hierarchy for learning algorithms and the models those algorithms produce. This is a super-task of several sub-tasks (but JIRA does not allow subtasks of subtasks). See the requires links below for subtasks. Goals: * give intuitive structure to API, both for developers and for generated documentation * support meta-algorithms (e.g., boosting) * support generic functionality (e.g., evaluation) * reduce code duplication across classes [Design doc for class hierarchy | https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5261) In some cases ,The value of word's vector representation is too big
[ https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-5261: --- Description: Get data: {code:none} normalize_text() { awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e s/'/ ' /g -e s/“/\/g -e s/”/\/g \ -e 's// /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \ -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \ -e 's/«/ /g' | tr 0-9 } wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz gzip -d news.2013.en.shuffled.gz normalize_text news.2013.en.shuffled data.txt {code} {code:none} import org.apache.spark.mllib.feature.Word2Vec val text = sc.textFile(dataPath).map { t = t.split( ).toIterable } val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(100) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = res1: Float = 375059.84 val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(5) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = res3: Float = 1661285.2 {code} The average absolute value of the word's vector representation is 60731.8 {code} val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(1) {code} The average absolute value of the word's vector representation is 0.13889 was: Get data: {code:none} normalize_text() { awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e s/'/ ' /g -e s/“/\/g -e s/”/\/g \ -e 's// /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \ -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \ -e 's/«/ /g' | tr 0-9 } wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz gzip -d news.2013.en.shuffled.gz normalize_text news.2013.en.shuffled data.txt {code} {code:none} import org.apache.spark.mllib.feature.Word2Vec val text = sc.textFile(dataPath).map { t = t.split( ).toIterable } val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(5) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = res1: Float = 375059.84 val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(5) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = res3: Float = 1661285.2 {code} The average absolute value of the word's vector representation is 60731.8 {code} val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(1) {code} The average absolute value of the word's vector representation is 0.13889 In some cases ,The value of word's vector representation is too big --- Key: SPARK-5261 URL: https://issues.apache.org/jira/browse/SPARK-5261 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Guoqiang Li Get data: {code:none} normalize_text() { awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e s/'/ ' /g -e s/“/\/g -e s/”/\/g \ -e 's// /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \ -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \ -e 's/«/ /g' | tr 0-9 } wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz gzip -d news.2013.en.shuffled.gz normalize_text news.2013.en.shuffled data.txt {code} {code:none} import org.apache.spark.mllib.feature.Word2Vec val text = sc.textFile(dataPath).map { t = t.split( ).toIterable } val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(100) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = res1: Float = 375059.84 val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(5)
[jira] [Updated] (SPARK-5261) In some cases ,The value of word's vector representation is too big
[ https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-5261: --- Description: Get data: {code:none} normalize_text() { awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e s/'/ ' /g -e s/“/\/g -e s/”/\/g \ -e 's// /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \ -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \ -e 's/«/ /g' | tr 0-9 } wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz gzip -d news.2013.en.shuffled.gz normalize_text news.2013.en.shuffled data.txt {code} {code:none} import org.apache.spark.mllib.feature.Word2Vec val text = sc.textFile(dataPath).map { t = t.split( ).toIterable } val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(5) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = res1: Float = 375059.84 val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(100) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = res3: Float = 1661285.2 val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(1) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = 0.13889 {code} was: Get data: {code:none} normalize_text() { awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e s/'/ ' /g -e s/“/\/g -e s/”/\/g \ -e 's// /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \ -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \ -e 's/«/ /g' | tr 0-9 } wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz gzip -d news.2013.en.shuffled.gz normalize_text news.2013.en.shuffled data.txt {code} {code:none} import org.apache.spark.mllib.feature.Word2Vec val text = sc.textFile(dataPath).map { t = t.split( ).toIterable } val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(100) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = res1: Float = 375059.84 val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(5) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = res3: Float = 1661285.2 val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(1) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = 0.13889 {code} In some cases ,The value of word's vector representation is too big --- Key: SPARK-5261 URL: https://issues.apache.org/jira/browse/SPARK-5261 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Guoqiang Li Get data: {code:none} normalize_text() { awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e s/'/ ' /g -e s/“/\/g -e s/”/\/g \ -e 's// /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \ -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \ -e 's/«/ /g' | tr 0-9 } wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz gzip -d news.2013.en.shuffled.gz normalize_text news.2013.en.shuffled data.txt {code} {code:none} import org.apache.spark.mllib.feature.Word2Vec val text = sc.textFile(dataPath).map { t = t.split( ).toIterable } val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(5) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = res1: Float = 375059.84 val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(100) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 /
[jira] [Commented] (SPARK-6577) SparseMatrix should be supported in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481395#comment-14481395 ] Manoj Kumar commented on SPARK-6577: Let us please take the discussion to the Pull Request. Thanks! SparseMatrix should be supported in PySpark --- Key: SPARK-6577 URL: https://issues.apache.org/jira/browse/SPARK-6577 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Manoj Kumar Assignee: Manoj Kumar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5261) In some cases ,The value of word's vector representation is too big
[ https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481378#comment-14481378 ] Guoqiang Li commented on SPARK-5261: I'm sorry, the after one 's mincount is 100 In some cases ,The value of word's vector representation is too big --- Key: SPARK-5261 URL: https://issues.apache.org/jira/browse/SPARK-5261 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Guoqiang Li Get data: {code:none} normalize_text() { awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e s/'/ ' /g -e s/“/\/g -e s/”/\/g \ -e 's// /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \ -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \ -e 's/«/ /g' | tr 0-9 } wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz gzip -d news.2013.en.shuffled.gz normalize_text news.2013.en.shuffled data.txt {code} {code:none} import org.apache.spark.mllib.feature.Word2Vec val text = sc.textFile(dataPath).map { t = t.split( ).toIterable } val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(100) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = res1: Float = 375059.84 val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(5) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = res3: Float = 1661285.2 {code} The average absolute value of the word's vector representation is 60731.8 {code} val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(1) {code} The average absolute value of the word's vector representation is 0.13889 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5261) In some cases ,The value of word's vector representation is too big
[ https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-5261: --- Description: Get data: {code:none} normalize_text() { awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e s/'/ ' /g -e s/“/\/g -e s/”/\/g \ -e 's// /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \ -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \ -e 's/«/ /g' | tr 0-9 } wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz gzip -d news.2013.en.shuffled.gz normalize_text news.2013.en.shuffled data.txt {code} {code:none} import org.apache.spark.mllib.feature.Word2Vec val text = sc.textFile(dataPath).map { t = t.split( ).toIterable } val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(100) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = res1: Float = 375059.84 val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(5) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = res3: Float = 1661285.2 val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(1) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = 0.13889 {code} was: Get data: {code:none} normalize_text() { awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e s/'/ ' /g -e s/“/\/g -e s/”/\/g \ -e 's// /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \ -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \ -e 's/«/ /g' | tr 0-9 } wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz gzip -d news.2013.en.shuffled.gz normalize_text news.2013.en.shuffled data.txt {code} {code:none} import org.apache.spark.mllib.feature.Word2Vec val text = sc.textFile(dataPath).map { t = t.split( ).toIterable } val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(100) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = res1: Float = 375059.84 val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(5) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = res3: Float = 1661285.2 {code} The average absolute value of the word's vector representation is 60731.8 {code} val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(1) {code} The average absolute value of the word's vector representation is 0.13889 In some cases ,The value of word's vector representation is too big --- Key: SPARK-5261 URL: https://issues.apache.org/jira/browse/SPARK-5261 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Guoqiang Li Get data: {code:none} normalize_text() { awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e s/'/ ' /g -e s/“/\/g -e s/”/\/g \ -e 's// /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \ -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \ -e 's/«/ /g' | tr 0-9 } wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz gzip -d news.2013.en.shuffled.gz normalize_text news.2013.en.shuffled data.txt {code} {code:none} import org.apache.spark.mllib.feature.Word2Vec val text = sc.textFile(dataPath).map { t = t.split( ).toIterable } val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(100) val model = word2Vec.fit(text) model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / model.getVectors.size = res1: Float = 375059.84 val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36). setMinCount(5) val model = word2Vec.fit(text)