[jira] [Assigned] (SPARK-8374) Job frequently hangs after YARN preemption
[ https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8374: --- Assignee: (was: Apache Spark) Job frequently hangs after YARN preemption -- Key: SPARK-8374 URL: https://issues.apache.org/jira/browse/SPARK-8374 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.0 Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04 Reporter: Shay Rojansky Priority: Critical After upgrading to Spark 1.4.0, jobs that get preempted very frequently will not reacquire executors and will therefore hang. To reproduce: 1. I run Spark job A that acquires all grid resources 2. I run Spark job B in a higher-priority queue that acquires all grid resources. Job A is fully preempted. 3. Kill job B, releasing all resources 4. Job A should at this point reacquire all grid resources, but occasionally doesn't. Repeating the preemption scenario makes the bad behavior occur within a few attempts. (see logs at bottom). Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption issues, maybe the work there is related to the new issues. The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've downgraded to 1.3.1 just because of this issue). Logs -- When job B (the preemptor first acquires an application master, the following is logged by job A (the preemptee): {noformat} ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc client disassociated INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost) INFO DAGScheduler: Executor lost: 447 (epoch 0) INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from BlockManagerMaster. INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, g023.grid.eaglerd.local, 41406) INFO BlockManagerMaster: Removed 447 successfully in removeExecutor {noformat} (It's strange for errors/warnings to be logged for preemption) Later, when job B's AM starts requesting its resources, I get lots of the following in job A: {noformat} ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc client disassociated INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0 WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost) WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. {noformat} Finally, when I kill job B, job A emits lots of the following: {noformat} INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31 WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist! {noformat} And finally after some time: {noformat} WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 165964 ms exceeds timeout 12 ms ERROR YarnScheduler: Lost an executor 466 (already removed): Executor heartbeat timed out after 165964 ms {noformat} At this point the job never requests/acquires more resources and hangs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8374) Job frequently hangs after YARN preemption
[ https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8374: --- Assignee: Apache Spark Job frequently hangs after YARN preemption -- Key: SPARK-8374 URL: https://issues.apache.org/jira/browse/SPARK-8374 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.0 Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04 Reporter: Shay Rojansky Assignee: Apache Spark Priority: Critical After upgrading to Spark 1.4.0, jobs that get preempted very frequently will not reacquire executors and will therefore hang. To reproduce: 1. I run Spark job A that acquires all grid resources 2. I run Spark job B in a higher-priority queue that acquires all grid resources. Job A is fully preempted. 3. Kill job B, releasing all resources 4. Job A should at this point reacquire all grid resources, but occasionally doesn't. Repeating the preemption scenario makes the bad behavior occur within a few attempts. (see logs at bottom). Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption issues, maybe the work there is related to the new issues. The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've downgraded to 1.3.1 just because of this issue). Logs -- When job B (the preemptor first acquires an application master, the following is logged by job A (the preemptee): {noformat} ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc client disassociated INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost) INFO DAGScheduler: Executor lost: 447 (epoch 0) INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from BlockManagerMaster. INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, g023.grid.eaglerd.local, 41406) INFO BlockManagerMaster: Removed 447 successfully in removeExecutor {noformat} (It's strange for errors/warnings to be logged for preemption) Later, when job B's AM starts requesting its resources, I get lots of the following in job A: {noformat} ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc client disassociated INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0 WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost) WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. {noformat} Finally, when I kill job B, job A emits lots of the following: {noformat} INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31 WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist! {noformat} And finally after some time: {noformat} WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 165964 ms exceeds timeout 12 ms ERROR YarnScheduler: Lost an executor 466 (already removed): Executor heartbeat timed out after 165964 ms {noformat} At this point the job never requests/acquires more resources and hangs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8704) Add additional methods to wrappers in ml.pyspark.feature
[ https://issues.apache.org/jira/browse/SPARK-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8704: --- Assignee: Apache Spark Add additional methods to wrappers in ml.pyspark.feature Key: SPARK-8704 URL: https://issues.apache.org/jira/browse/SPARK-8704 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: Manoj Kumar Assignee: Apache Spark std, mean to StandardScalerModel getVectors, findSynonyms to Word2Vec Model setFeatures and getFeatures to hashingTF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8704) Add additional methods to wrappers in ml.pyspark.feature
[ https://issues.apache.org/jira/browse/SPARK-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8704: --- Assignee: (was: Apache Spark) Add additional methods to wrappers in ml.pyspark.feature Key: SPARK-8704 URL: https://issues.apache.org/jira/browse/SPARK-8704 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: Manoj Kumar std, mean to StandardScalerModel getVectors, findSynonyms to Word2Vec Model setFeatures and getFeatures to hashingTF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8337) KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version
[ https://issues.apache.org/jira/browse/SPARK-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605588#comment-14605588 ] Juan Rodríguez Hortalá commented on SPARK-8337: --- Hi [~jerryshao], That is a good idea, I should had paid more attention to the discussion in the duplicated issue. I will try that way, and tell you how it went. Greetings, Juan KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version -- Key: SPARK-8337 URL: https://issues.apache.org/jira/browse/SPARK-8337 Project: Spark Issue Type: Bug Components: PySpark, Streaming Affects Versions: 1.4.0 Reporter: Amit Ramesh Priority: Critical See the following thread for context. http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Spark-1-4-Python-API-for-getting-Kafka-offsets-in-direct-mode-tt12714.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6599) Improve reliability and usability of Kinesis-based Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-6599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605591#comment-14605591 ] Przemyslaw Pastuszka commented on SPARK-6599: - Is there any work being done on this? Can I help somehow? Improve reliability and usability of Kinesis-based Spark Streaming -- Key: SPARK-6599 URL: https://issues.apache.org/jira/browse/SPARK-6599 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Currently, the KinesisReceiver can loose some data in the case of certain failures (receiver and driver failures). Using the write ahead logs can mitigate some of the problem, but it is not ideal because WALs dont work with S3 (eventually consistency, etc.) which is the most likely file system to be used in the EC2 environment. Hence, we have to take a different approach to improving reliability for Kinesis. A detailed design doc on how this can be achieved will be added later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605657#comment-14605657 ] Lars Francke commented on SPARK-2447: - Hey Ted et. al, thanks for the work on this. SparkOnHBase is super useful and clients are happily using it. I wonder however what the future direction will be. Any progress on the question whether it's going to be integrated into Spark or not? I don't have a strong opinion either but I also don't feel that it would be _wrong_ to put it into core... Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8693) profiles and goals are not printed in a nice way
[ https://issues.apache.org/jira/browse/SPARK-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8693: --- Assignee: (was: Apache Spark) profiles and goals are not printed in a nice way Key: SPARK-8693 URL: https://issues.apache.org/jira/browse/SPARK-8693 Project: Spark Issue Type: Sub-task Components: Build, Project Infra Reporter: Yin Huai Priority: Minor In our master build, I see {code} -Phadoop-1[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Dhadoop.version=1.0.4[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Pkinesis-asl[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Phive-thriftserver[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Phive[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: package[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: assembly/assembly[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: streaming-kafka-assembly/assembly {code} Seems we format the string in a wrong way? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8693) profiles and goals are not printed in a nice way
[ https://issues.apache.org/jira/browse/SPARK-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8693: --- Assignee: Apache Spark profiles and goals are not printed in a nice way Key: SPARK-8693 URL: https://issues.apache.org/jira/browse/SPARK-8693 Project: Spark Issue Type: Sub-task Components: Build, Project Infra Reporter: Yin Huai Assignee: Apache Spark Priority: Minor In our master build, I see {code} -Phadoop-1[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Dhadoop.version=1.0.4[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Pkinesis-asl[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Phive-thriftserver[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Phive[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: package[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: assembly/assembly[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: streaming-kafka-assembly/assembly {code} Seems we format the string in a wrong way? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8636) CaseKeyWhen has incorrect NULL handling
[ https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605557#comment-14605557 ] Santiago M. Mola commented on SPARK-8636: - [~davies], [~animeshbaranawal] In SQL, NULL is never equal to NULL. Any comparison to NULL is UNKNOWN. Most SQL implementations represent UNKNOWN as NULL, too. CaseKeyWhen has incorrect NULL handling --- Key: SPARK-8636 URL: https://issues.apache.org/jira/browse/SPARK-8636 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Santiago M. Mola Labels: starter CaseKeyWhen implementation in Spark uses the following equals implementation: {code} private def equalNullSafe(l: Any, r: Any) = { if (l == null r == null) { true } else if (l == null || r == null) { false } else { l == r } } {code} Which is not correct, since in SQL, NULL is never equal to NULL (actually, it is not unequal either). In this case, a NULL value in a CASE WHEN expression should never match. For example, you can execute this in MySQL: {code} SELECT CASE NULL WHEN NULL THEN NULL MATCHES ELSE NULL DOES NOT MATCH END FROM DUAL; {code} And the result will be NULL DOES NOT MATCH. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8693) profiles and goals are not printed in a nice way
[ https://issues.apache.org/jira/browse/SPARK-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605606#comment-14605606 ] Apache Spark commented on SPARK-8693: - User 'brennonyork' has created a pull request for this issue: https://github.com/apache/spark/pull/7085 profiles and goals are not printed in a nice way Key: SPARK-8693 URL: https://issues.apache.org/jira/browse/SPARK-8693 Project: Spark Issue Type: Sub-task Components: Build, Project Infra Reporter: Yin Huai Priority: Minor In our master build, I see {code} -Phadoop-1[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Dhadoop.version=1.0.4[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Pkinesis-asl[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Phive-thriftserver[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Phive[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: package[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: assembly/assembly[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: streaming-kafka-assembly/assembly {code} Seems we format the string in a wrong way? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8704) Add additional methods to wrappers in ml.pyspark.feature
[ https://issues.apache.org/jira/browse/SPARK-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605611#comment-14605611 ] Apache Spark commented on SPARK-8704: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/7086 Add additional methods to wrappers in ml.pyspark.feature Key: SPARK-8704 URL: https://issues.apache.org/jira/browse/SPARK-8704 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: Manoj Kumar std, mean to StandardScalerModel getVectors, findSynonyms to Word2Vec Model setFeatures and getFeatures to hashingTF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8666) checkpointing does not take advantage of persisted/cached RDDs
[ https://issues.apache.org/jira/browse/SPARK-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glenn Strycker closed SPARK-8666. - checkpointing does not take advantage of persisted/cached RDDs -- Key: SPARK-8666 URL: https://issues.apache.org/jira/browse/SPARK-8666 Project: Spark Issue Type: New Feature Reporter: Glenn Strycker I have been noticing that when checkpointing RDDs, all operations are occurring TWICE. For example, when I run the following code and watch the stages... {noformat} val newRDD = prevRDD.map(a = (a._1, 1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER) newRDD.checkpoint print(newRDD.count()) {noformat} I see distinct and count operations appearing TWICE, and shuffle disk writes and reads (from the distinct) occurring TWICE. My newRDD is persisted to memory, why can't the checkpoint simply save those partitions to disk when the first operations have completed? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8705) Javascript error in the web console when `totalExecutionTime` of a task is 0
Shixiong Zhu created SPARK-8705: --- Summary: Javascript error in the web console when `totalExecutionTime` of a task is 0 Key: SPARK-8705 URL: https://issues.apache.org/jira/browse/SPARK-8705 Project: Spark Issue Type: Bug Components: Web UI Reporter: Shixiong Zhu Because System.currentTimeMillis() is not accurate for tasks that only need several milliseconds, sometimes totalExecutionTime in makeTimeline will be 0. If totalExecutionTime is 0, there will the following error in the console. !https://cloud.githubusercontent.com/assets/1000778/8406776/5cd38e04-1e92-11e5-89f2-0c5134fe4b6b.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605663#comment-14605663 ] Ted Malaska commented on SPARK-2447: Yeah, I have talked a lot with TD (Spark), Job H(HBase), Stacks(HBase) about this. Nether thing HBase or Spark is the right project to put it in. Right now the code is in Cloudera Labs and a github and works for CDH 5.3 and 5.4 we have a number of clients on it. There is talk to make it an apache project. It is apache listened but it would be nice to put it under apache totally. The problem is it is soo simple some times it feels to small to be it's own project. The design is just to have a HBase connection in a static location in the executor. I know other NoSql brag about local gets, but HBase already had that even without SparkOnHBase. The Table input format already gives you local gets. All Spark on HBase gives you is an active connection that can be accessed in the distributed function of Spark. Which is very important to some use cases. Like Spark Streaming and complex graph local. Let me know. We are open to ideas. Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8704) Add additional methods to wrappers in ml.pyspark.feature
Manoj Kumar created SPARK-8704: -- Summary: Add additional methods to wrappers in ml.pyspark.feature Key: SPARK-8704 URL: https://issues.apache.org/jira/browse/SPARK-8704 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: Manoj Kumar std, mean to StandardScalerModel getVectors, findSynonyms to Word2Vec Model setFeatures and getFeatures to hashingTF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict
[ https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605639#comment-14605639 ] Rakesh Chalasani commented on SPARK-8587: - Hi Sam, computeCost now returns the cumulative cost over a dataset, rather than cost per sample, which i think this JIRA is for. Internally, predict does compute the distance to nearest point but return only the predicted center. So, adding a method that returns distances is doing the job twice and that is what is pointed above for Bradley. In Pipelines, on the other hand, this can handled more gracefully and efficiently by adding a column to the returning DF. If that is good for you, can you close this JIRA? I will create another one for adding distances to the KMeans pipeline, once that is merged. thanks. Return cost and cluster index KMeansModel.predict - Key: SPARK-8587 URL: https://issues.apache.org/jira/browse/SPARK-8587 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Sam Stoelinga Priority: Minor Looking at PySpark the implementation of KMeansModel.predict https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102 : Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8705) Javascript error in the web console when `totalExecutionTime` of a task is 0
[ https://issues.apache.org/jira/browse/SPARK-8705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605646#comment-14605646 ] Shixiong Zhu commented on SPARK-8705: - A simple fix is don't add {{rect}} s to {{svg}} when {{totalExecutionTime}} is 0 in https://github.com/apache/spark/blob/04ddcd4db7801abefa9c9effe5d88413b29d713b/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala#L599 This conflicts with https://github.com/apache/spark/pull/7082 , so I will send a PR after pr #7082 is merged. Javascript error in the web console when `totalExecutionTime` of a task is 0 Key: SPARK-8705 URL: https://issues.apache.org/jira/browse/SPARK-8705 Project: Spark Issue Type: Bug Components: Web UI Reporter: Shixiong Zhu Because System.currentTimeMillis() is not accurate for tasks that only need several milliseconds, sometimes totalExecutionTime in makeTimeline will be 0. If totalExecutionTime is 0, there will the following error in the console. !https://cloud.githubusercontent.com/assets/1000778/8406776/5cd38e04-1e92-11e5-89f2-0c5134fe4b6b.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8660) Update comments that contain R statements in ml.logisticRegressionSuite
[ https://issues.apache.org/jira/browse/SPARK-8660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605481#comment-14605481 ] somil deshmukh commented on SPARK-8660: --- In LogisticRegressionSuite.class ,I will replace comment /** with /* .like this /* Using the following R code to load the data and train the model using glmnet package. library(glmnet) data - read.csv(path, header=FALSE) label = factor(data$V1) features = as.matrix(data.frame(data$V2, data$V3, data$V4, data$V5)) weights = coef(glmnet(features,label, family=binomial, alpha = 0, lambda = 0)) weights 5 x 1 sparse Matrix of class dgCMatrix s0 (Intercept) 2.8366423 data.V2 -0.5895848 data.V3 0.8931147 data.V4 -0.3925051 data.V5 -0.7996864 */ Update comments that contain R statements in ml.logisticRegressionSuite --- Key: SPARK-8660 URL: https://issues.apache.org/jira/browse/SPARK-8660 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Priority: Trivial Labels: starter Original Estimate: 20m Remaining Estimate: 20m We put R statements as comments in unit test. However, there are two issues: 1. JavaDoc style /** ... */ is used instead of normal multiline comment /* ... */. 2. We put a leading * on each line. It is hard to copy paste the commands to/from R and verify the result. For example, in https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala#L504 {code} /** * Using the following R code to load the data and train the model using glmnet package. * * library(glmnet) * data - read.csv(path, header=FALSE) * label = factor(data$V1) * features = as.matrix(data.frame(data$V2, data$V3, data$V4, data$V5)) * weights = coef(glmnet(features,label, family=binomial, alpha = 1.0, lambda = 6.0)) * weights * 5 x 1 sparse Matrix of class dgCMatrix * s0 * (Intercept) -0.2480643 * data.V2 0.000 * data.V3 . * data.V4 . * data.V5 . */ {code} should change to {code} /* Using the following R code to load the data and train the model using glmnet package. library(glmnet) data - read.csv(path, header=FALSE) label = factor(data$V1) features = as.matrix(data.frame(data$V2, data$V3, data$V4, data$V5)) weights = coef(glmnet(features,label, family=binomial, alpha = 1.0, lambda = 6.0)) weights 5 x 1 sparse Matrix of class dgCMatrix s0 (Intercept) -0.2480643 data.V2 0.000 data.V3 . data.V4 . data.V5 . */ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8703) Add CountVectorizer as a ml transformer to convert document to words count vector
yuhao yang created SPARK-8703: - Summary: Add CountVectorizer as a ml transformer to convert document to words count vector Key: SPARK-8703 URL: https://issues.apache.org/jira/browse/SPARK-8703 Project: Spark Issue Type: New Feature Components: ML Reporter: yuhao yang Converts a text document to a sparse vector of token counts. I can further add an estimator to extract vocabulary from corpus if that's appropriate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8702) Avoid massive concating strings in Javascript
Shixiong Zhu created SPARK-8702: --- Summary: Avoid massive concating strings in Javascript Key: SPARK-8702 URL: https://issues.apache.org/jira/browse/SPARK-8702 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Shixiong Zhu When there are massive tasks, such as {{sc.parallelize(1 to 10, 1).count()}}, the generated JS codes have a lot of string concatenations in the stage page, nearly 40 string concatenations for one task. We can generate the whole string for a task instead of execution string concatenations in the browser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7398) Add back-pressure to Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Iulian Dragos updated SPARK-7398: - Description: Spark Streaming has trouble dealing with situations where batch processing time batch interval Meaning a high throughput of input data w.r.t. Spark's ability to remove data from the queue. If this throughput is sustained for long enough, it leads to an unstable situation where the memory of the Receiver's Executor is overflowed. This aims at transmitting a back-pressure signal back to data ingestion to help with dealing with that high throughput, in a backwards-compatible way. The original design doc can be found here: https://docs.google.com/document/d/1ZhiP_yBHcbjifz8nJEyPJpHqxB1FT6s8-Zk7sAfayQw/edit?usp=sharing The second design doc (without all the background info, and more centered on the implementation) can be found here: https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing was: Spark Streaming has trouble dealing with situations where batch processing time batch interval Meaning a high throughput of input data w.r.t. Spark's ability to remove data from the queue. If this throughput is sustained for long enough, it leads to an unstable situation where the memory of the Receiver's Executor is overflowed. This aims at transmitting a back-pressure signal back to data ingestion to help with dealing with that high throughput, in a backwards-compatible way. The design doc can be found here: https://docs.google.com/document/d/1ZhiP_yBHcbjifz8nJEyPJpHqxB1FT6s8-Zk7sAfayQw/edit?usp=sharing Add back-pressure to Spark Streaming Key: SPARK-7398 URL: https://issues.apache.org/jira/browse/SPARK-7398 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.1 Reporter: François Garillot Priority: Critical Labels: streams Spark Streaming has trouble dealing with situations where batch processing time batch interval Meaning a high throughput of input data w.r.t. Spark's ability to remove data from the queue. If this throughput is sustained for long enough, it leads to an unstable situation where the memory of the Receiver's Executor is overflowed. This aims at transmitting a back-pressure signal back to data ingestion to help with dealing with that high throughput, in a backwards-compatible way. The original design doc can be found here: https://docs.google.com/document/d/1ZhiP_yBHcbjifz8nJEyPJpHqxB1FT6s8-Zk7sAfayQw/edit?usp=sharing The second design doc (without all the background info, and more centered on the implementation) can be found here: https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7398) Add back-pressure to Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605338#comment-14605338 ] Iulian Dragos commented on SPARK-7398: -- [~tdas] here it is: https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing Add back-pressure to Spark Streaming Key: SPARK-7398 URL: https://issues.apache.org/jira/browse/SPARK-7398 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.1 Reporter: François Garillot Priority: Critical Labels: streams Spark Streaming has trouble dealing with situations where batch processing time batch interval Meaning a high throughput of input data w.r.t. Spark's ability to remove data from the queue. If this throughput is sustained for long enough, it leads to an unstable situation where the memory of the Receiver's Executor is overflowed. This aims at transmitting a back-pressure signal back to data ingestion to help with dealing with that high throughput, in a backwards-compatible way. The design doc can be found here: https://docs.google.com/document/d/1ZhiP_yBHcbjifz8nJEyPJpHqxB1FT6s8-Zk7sAfayQw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8374) Job frequently hangs after YARN preemption
[ https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605457#comment-14605457 ] Apache Spark commented on SPARK-8374: - User 'xuchenCN' has created a pull request for this issue: https://github.com/apache/spark/pull/7083 Job frequently hangs after YARN preemption -- Key: SPARK-8374 URL: https://issues.apache.org/jira/browse/SPARK-8374 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.0 Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04 Reporter: Shay Rojansky Priority: Critical After upgrading to Spark 1.4.0, jobs that get preempted very frequently will not reacquire executors and will therefore hang. To reproduce: 1. I run Spark job A that acquires all grid resources 2. I run Spark job B in a higher-priority queue that acquires all grid resources. Job A is fully preempted. 3. Kill job B, releasing all resources 4. Job A should at this point reacquire all grid resources, but occasionally doesn't. Repeating the preemption scenario makes the bad behavior occur within a few attempts. (see logs at bottom). Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption issues, maybe the work there is related to the new issues. The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've downgraded to 1.3.1 just because of this issue). Logs -- When job B (the preemptor first acquires an application master, the following is logged by job A (the preemptee): {noformat} ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc client disassociated INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost) INFO DAGScheduler: Executor lost: 447 (epoch 0) INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from BlockManagerMaster. INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, g023.grid.eaglerd.local, 41406) INFO BlockManagerMaster: Removed 447 successfully in removeExecutor {noformat} (It's strange for errors/warnings to be logged for preemption) Later, when job B's AM starts requesting its resources, I get lots of the following in job A: {noformat} ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc client disassociated INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0 WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost) WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. {noformat} Finally, when I kill job B, job A emits lots of the following: {noformat} INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31 WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist! {noformat} And finally after some time: {noformat} WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 165964 ms exceeds timeout 12 ms ERROR YarnScheduler: Lost an executor 466 (already removed): Executor heartbeat timed out after 165964 ms {noformat} At this point the job never requests/acquires more resources and hangs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8310) Spark EC2 branch in 1.4 is wrong
[ https://issues.apache.org/jira/browse/SPARK-8310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605411#comment-14605411 ] Daniel Darabos commented on SPARK-8310: --- It's an easy mistake to make, and one of the few things that are not covered by the release candidate process. We tested the release candidate on EC2, but we had to specifically override the version, since at that point there was no released 1.4.0. I have no idea how this could be avoided for future releases. Spark EC2 branch in 1.4 is wrong Key: SPARK-8310 URL: https://issues.apache.org/jira/browse/SPARK-8310 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical Fix For: 1.4.1, 1.5.0 It points to `branch-1.3` of spark-ec2 right now while it should point to `branch-1.4` cc [~brdwrd] [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8374) Job frequently hangs after YARN preemption
[ https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605422#comment-14605422 ] Xu Chen commented on SPARK-8374: Seems AM didn't add ContainerRequest after resource has been preempted I can provide a path for this issue , could you help me to test it ? Job frequently hangs after YARN preemption -- Key: SPARK-8374 URL: https://issues.apache.org/jira/browse/SPARK-8374 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.0 Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04 Reporter: Shay Rojansky Priority: Critical After upgrading to Spark 1.4.0, jobs that get preempted very frequently will not reacquire executors and will therefore hang. To reproduce: 1. I run Spark job A that acquires all grid resources 2. I run Spark job B in a higher-priority queue that acquires all grid resources. Job A is fully preempted. 3. Kill job B, releasing all resources 4. Job A should at this point reacquire all grid resources, but occasionally doesn't. Repeating the preemption scenario makes the bad behavior occur within a few attempts. (see logs at bottom). Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption issues, maybe the work there is related to the new issues. The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've downgraded to 1.3.1 just because of this issue). Logs -- When job B (the preemptor first acquires an application master, the following is logged by job A (the preemptee): {noformat} ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc client disassociated INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost) INFO DAGScheduler: Executor lost: 447 (epoch 0) INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from BlockManagerMaster. INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, g023.grid.eaglerd.local, 41406) INFO BlockManagerMaster: Removed 447 successfully in removeExecutor {noformat} (It's strange for errors/warnings to be logged for preemption) Later, when job B's AM starts requesting its resources, I get lots of the following in job A: {noformat} ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc client disassociated INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0 WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost) WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. {noformat} Finally, when I kill job B, job A emits lots of the following: {noformat} INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31 WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist! {noformat} And finally after some time: {noformat} WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 165964 ms exceeds timeout 12 ms ERROR YarnScheduler: Lost an executor 466 (already removed): Executor heartbeat timed out after 165964 ms {noformat} At this point the job never requests/acquires more resources and hangs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8661) Update comments that contain R statements in ml.LinearRegressionSuite
[ https://issues.apache.org/jira/browse/SPARK-8661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605483#comment-14605483 ] somil deshmukh commented on SPARK-8661: --- In LinearRegressionSuite.class,I can replace /** with /* ,like this /* Using the following R code to load the data and train the model using glmnet package. library(glmnet) data - read.csv(path, header=FALSE, stringsAsFactors=FALSE) features - as.matrix(data.frame(as.numeric(data$V2), as.numeric(data$V3))) label - as.numeric(data$V1) weights - coef(glmnet(features, label, family=gaussian, alpha = 0, lambda = 0)) weights 3 x 1 sparse Matrix of class dgCMatrix s0 (Intercept) 6.300528 as.numeric.data.V2. 4.701024 as.numeric.data.V3. 7.198257 */ Do you want to remove /** for each method ,or specific this method only ? Update comments that contain R statements in ml.LinearRegressionSuite - Key: SPARK-8661 URL: https://issues.apache.org/jira/browse/SPARK-8661 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Labels: starter Original Estimate: 20m Remaining Estimate: 20m Similar to SPARK-8660, but for ml.LinearRegressionSuite: https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8703) Add CountVectorizer as a ml transformer to convert document to words count vector
[ https://issues.apache.org/jira/browse/SPARK-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605503#comment-14605503 ] Apache Spark commented on SPARK-8703: - User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/7084 Add CountVectorizer as a ml transformer to convert document to words count vector - Key: SPARK-8703 URL: https://issues.apache.org/jira/browse/SPARK-8703 Project: Spark Issue Type: New Feature Components: ML Reporter: yuhao yang Original Estimate: 24h Remaining Estimate: 24h Converts a text document to a sparse vector of token counts. I can further add an estimator to extract vocabulary from corpus if that's appropriate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8702) Avoid massive concating strings in Javascript
[ https://issues.apache.org/jira/browse/SPARK-8702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-8702: -- Assignee: Shixiong Zhu Avoid massive concating strings in Javascript - Key: SPARK-8702 URL: https://issues.apache.org/jira/browse/SPARK-8702 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Shixiong Zhu Assignee: Shixiong Zhu When there are massive tasks, such as {{sc.parallelize(1 to 10, 1).count()}}, the generated JS codes have a lot of string concatenations in the stage page, nearly 40 string concatenations for one task. We can generate the whole string for a task instead of execution string concatenations in the browser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8385) java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation
[ https://issues.apache.org/jira/browse/SPARK-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605712#comment-14605712 ] Ángel Álvarez commented on SPARK-8385: -- A simple WordCount test worked fine in my Eclipse environment with Spark 1.4 (in both, local and yarn-cluster modes). Make sure you don't have any reference to the previous 1.3 version in your project and launch configuration. java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation - Key: SPARK-8385 URL: https://issues.apache.org/jira/browse/SPARK-8385 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.4.0 Environment: RHEL 7.1 Reporter: Peter Haumer I used to be able to debug my Spark apps in Eclipse. With Spark 1.3.1 I created a launch and just set the vm var -Dspark.master=local[4]. With 1.4 this stopped working when reading files from the OS filesystem. Running the same apps with spark-submit works fine. Loosing the ability to debug that way has a major impact on the usability of Spark. The following exception is thrown: Exception in thread main java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:213) at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2401) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2411) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166) at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:653) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:389) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362) at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762) at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1535) at org.apache.spark.rdd.RDD.reduce(RDD.scala:900) at org.apache.spark.api.java.JavaRDDLike$class.reduce(JavaRDDLike.scala:357) at org.apache.spark.api.java.AbstractJavaRDDLike.reduce(JavaRDDLike.scala:46) at com.databricks.apps.logs.LogAnalyzer.main(LogAnalyzer.java:60) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7894) Graph Union Operator
[ https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605739#comment-14605739 ] Arnab commented on SPARK-7894: -- Short description of changes: Introduced union functionality in EdgeRDD, VertexRDD and Graph classes (there is no union functionality in EdgeRdd and VertexRdd directly as pointed out by shijinkui) Added code for merging partitions in Edge and Vertex partitions Added test case for graph union (as in Jira) , also unit tests for union of edges and vertices Graph Union Operator Key: SPARK-7894 URL: https://issues.apache.org/jira/browse/SPARK-7894 Project: Spark Issue Type: Sub-task Components: GraphX Reporter: Andy Huang Labels: graph, union Attachments: union_operator.png This operator aims to union two graphs and generate a new graph directly. The union of two graphs is the union of their vertex sets and their edge families.Vertexes and edges which are included in either graph will be part of the new graph. bq. G ∪ H = (VG ∪ VH, EG ∪ EH). The below image shows a union of graph G and graph H !union_operator.png|width=600px,align=center! A Simple interface would be: bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED] However, inevitably vertexes and edges overlapping will happen between borders of graphs. For vertex, it's quite nature to just make a union and remove those duplicate ones. But for edges, a mergeEdges function seems to be more reasonable. bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: (ED, ED) = ED): Graph[VD, ED] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8385) java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation
[ https://issues.apache.org/jira/browse/SPARK-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8385. -- Resolution: Cannot Reproduce java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation - Key: SPARK-8385 URL: https://issues.apache.org/jira/browse/SPARK-8385 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.4.0 Environment: RHEL 7.1 Reporter: Peter Haumer I used to be able to debug my Spark apps in Eclipse. With Spark 1.3.1 I created a launch and just set the vm var -Dspark.master=local[4]. With 1.4 this stopped working when reading files from the OS filesystem. Running the same apps with spark-submit works fine. Loosing the ability to debug that way has a major impact on the usability of Spark. The following exception is thrown: Exception in thread main java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:213) at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2401) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2411) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166) at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:653) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:389) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362) at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762) at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1535) at org.apache.spark.rdd.RDD.reduce(RDD.scala:900) at org.apache.spark.api.java.JavaRDDLike$class.reduce(JavaRDDLike.scala:357) at org.apache.spark.api.java.AbstractJavaRDDLike.reduce(JavaRDDLike.scala:46) at com.databricks.apps.logs.LogAnalyzer.main(LogAnalyzer.java:60) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8385) java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation
[ https://issues.apache.org/jira/browse/SPARK-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605744#comment-14605744 ] Ángel Álvarez edited comment on SPARK-8385 at 6/29/15 3:15 PM: --- I could finally reproduce this same error in Eclipse (yarn-cluster mode) and it was due to a reference to the spark assembly 1.3 in my launch configuration. was (Author: angel2014): I could finally reproduce this same error in Eclipse (yarn-cluster mode) and it was due a reference to the spark assembly 1.3 in my launch configuration. java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation - Key: SPARK-8385 URL: https://issues.apache.org/jira/browse/SPARK-8385 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.4.0 Environment: RHEL 7.1 Reporter: Peter Haumer I used to be able to debug my Spark apps in Eclipse. With Spark 1.3.1 I created a launch and just set the vm var -Dspark.master=local[4]. With 1.4 this stopped working when reading files from the OS filesystem. Running the same apps with spark-submit works fine. Loosing the ability to debug that way has a major impact on the usability of Spark. The following exception is thrown: Exception in thread main java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:213) at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2401) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2411) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166) at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:653) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:389) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362) at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762) at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1535) at org.apache.spark.rdd.RDD.reduce(RDD.scala:900) at org.apache.spark.api.java.JavaRDDLike$class.reduce(JavaRDDLike.scala:357) at org.apache.spark.api.java.AbstractJavaRDDLike.reduce(JavaRDDLike.scala:46) at com.databricks.apps.logs.LogAnalyzer.main(LogAnalyzer.java:60) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8680) PropagateTypes is very slow when there are lots of columns
[ https://issues.apache.org/jira/browse/SPARK-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605747#comment-14605747 ] Apache Spark commented on SPARK-8680: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/7087 PropagateTypes is very slow when there are lots of columns -- Key: SPARK-8680 URL: https://issues.apache.org/jira/browse/SPARK-8680 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1, 1.4.0 Reporter: Davies Liu The time for PropagateTypes is O(N*N), N is the number of columns, which is very slow if there many columns (1000) There easiest optimization could be put `q.inputSet` outside of transformExpressions which could have about 4 times improvement for N=3000 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8680) PropagateTypes is very slow when there are lots of columns
[ https://issues.apache.org/jira/browse/SPARK-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8680: --- Assignee: (was: Apache Spark) PropagateTypes is very slow when there are lots of columns -- Key: SPARK-8680 URL: https://issues.apache.org/jira/browse/SPARK-8680 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1, 1.4.0 Reporter: Davies Liu The time for PropagateTypes is O(N*N), N is the number of columns, which is very slow if there many columns (1000) There easiest optimization could be put `q.inputSet` outside of transformExpressions which could have about 4 times improvement for N=3000 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8680) PropagateTypes is very slow when there are lots of columns
[ https://issues.apache.org/jira/browse/SPARK-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8680: --- Assignee: Apache Spark PropagateTypes is very slow when there are lots of columns -- Key: SPARK-8680 URL: https://issues.apache.org/jira/browse/SPARK-8680 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1, 1.4.0 Reporter: Davies Liu Assignee: Apache Spark The time for PropagateTypes is O(N*N), N is the number of columns, which is very slow if there many columns (1000) There easiest optimization could be put `q.inputSet` outside of transformExpressions which could have about 4 times improvement for N=3000 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8385) java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation
[ https://issues.apache.org/jira/browse/SPARK-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605744#comment-14605744 ] Ángel Álvarez commented on SPARK-8385: -- I could finally reproduce this same error in Eclipse (yarn-cluster mode) and it was due a reference to the spark assembly 1.3 in my launch configuration. java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation - Key: SPARK-8385 URL: https://issues.apache.org/jira/browse/SPARK-8385 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.4.0 Environment: RHEL 7.1 Reporter: Peter Haumer I used to be able to debug my Spark apps in Eclipse. With Spark 1.3.1 I created a launch and just set the vm var -Dspark.master=local[4]. With 1.4 this stopped working when reading files from the OS filesystem. Running the same apps with spark-submit works fine. Loosing the ability to debug that way has a major impact on the usability of Spark. The following exception is thrown: Exception in thread main java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:213) at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2401) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2411) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166) at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:653) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:389) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362) at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762) at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1535) at org.apache.spark.rdd.RDD.reduce(RDD.scala:900) at org.apache.spark.api.java.JavaRDDLike$class.reduce(JavaRDDLike.scala:357) at org.apache.spark.api.java.AbstractJavaRDDLike.reduce(JavaRDDLike.scala:46) at com.databricks.apps.logs.LogAnalyzer.main(LogAnalyzer.java:60) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8599) Use a Random operator to handle Random distribution generating expressions
[ https://issues.apache.org/jira/browse/SPARK-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605777#comment-14605777 ] Burak Yavuz commented on SPARK-8599: It would be great if it works for this case as well. I think [~mengxr] was hitting the bug during the filter phase for sampleBy. Use a Random operator to handle Random distribution generating expressions -- Key: SPARK-8599 URL: https://issues.apache.org/jira/browse/SPARK-8599 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Priority: Critical Right now, we are using expressions for Random distribution generating expressions. But, we have to track them in lots of places in the optimizer to handle them carefully. Otherwise, these expressions will be treated as stateless expressions and have unexpected behaviors (e.g. SPARK-8023). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8702) Avoid massive concating strings in Javascript
[ https://issues.apache.org/jira/browse/SPARK-8702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-8702. --- Resolution: Fixed Fix Version/s: 1.5.0 Target Version/s: 1.5.0 Avoid massive concating strings in Javascript - Key: SPARK-8702 URL: https://issues.apache.org/jira/browse/SPARK-8702 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Shixiong Zhu Assignee: Shixiong Zhu Fix For: 1.5.0 When there are massive tasks, such as {{sc.parallelize(1 to 10, 1).count()}}, the generated JS codes have a lot of string concatenations in the stage page, nearly 40 string concatenations for one task. We can generate the whole string for a task instead of execution string concatenations in the browser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8706) Implement Pylint / Prospector checks for PySpark
Josh Rosen created SPARK-8706: - Summary: Implement Pylint / Prospector checks for PySpark Key: SPARK-8706 URL: https://issues.apache.org/jira/browse/SPARK-8706 Project: Spark Issue Type: New Feature Components: Project Infra, PySpark Reporter: Josh Rosen It would be nice to implement Pylint / Prospector (https://github.com/landscapeio/prospector) checks for PySpark. As with the style checker rules, I'll imagine that we'll want to roll out new rules gradually in order to avoid a mass refactoring commit. For starters, we should create a pull request that introduces the harness for running the linters, add a configuration file which enables only the lint checks that currently pass, and install the required dependencies on Jenkins. Once we've done this, we can open a series of smaller followup PRs to gradually enable more linting checks and to fix existing violations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8703) Add CountVectorizer as a ml transformer to convert document to words count vector
[ https://issues.apache.org/jira/browse/SPARK-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605902#comment-14605902 ] Liang-Chi Hsieh commented on SPARK-8703: Does org.apache.spark.mllib.feature.HashingTF already provide similar function? If so, can this ml transformer reuse it? Add CountVectorizer as a ml transformer to convert document to words count vector - Key: SPARK-8703 URL: https://issues.apache.org/jira/browse/SPARK-8703 Project: Spark Issue Type: New Feature Components: ML Reporter: yuhao yang Original Estimate: 24h Remaining Estimate: 24h Converts a text document to a sparse vector of token counts. Similar to http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html I can further add an estimator to extract vocabulary from corpus if that's appropriate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8693) profiles and goals are not printed in a nice way
[ https://issues.apache.org/jira/browse/SPARK-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-8693: -- Assignee: Brennon York profiles and goals are not printed in a nice way Key: SPARK-8693 URL: https://issues.apache.org/jira/browse/SPARK-8693 Project: Spark Issue Type: Sub-task Components: Build, Project Infra Affects Versions: 1.5.0 Reporter: Yin Huai Assignee: Brennon York Priority: Minor Fix For: 1.5.0 In our master build, I see {code} -Phadoop-1[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Dhadoop.version=1.0.4[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Pkinesis-asl[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Phive-thriftserver[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Phive[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: package[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: assembly/assembly[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: streaming-kafka-assembly/assembly {code} Seems we format the string in a wrong way? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8686) DataFrame should support `where` with expression represented by String
[ https://issues.apache.org/jira/browse/SPARK-8686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8686: - Assignee: Kousuke Saruta DataFrame should support `where` with expression represented by String -- Key: SPARK-8686 URL: https://issues.apache.org/jira/browse/SPARK-8686 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Priority: Minor Fix For: 1.5.0 DataFrame supports `filter` function with two types of argument, `Column` and `String`. But `where` doesn't. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8554) Add the SparkR document files to `.rat-excludes` for `./dev/check-license`
[ https://issues.apache.org/jira/browse/SPARK-8554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-8554: -- Assignee: Yu Ishikawa Add the SparkR document files to `.rat-excludes` for `./dev/check-license` -- Key: SPARK-8554 URL: https://issues.apache.org/jira/browse/SPARK-8554 Project: Spark Issue Type: Bug Components: SparkR, Tests Reporter: Yu Ishikawa Assignee: Yu Ishikawa Fix For: 1.5.0 {noformat} ./dev/check-license | grep -v boto Could not find Apache license headers in the following files: !? /Users/01004981/local/src/spark/myspark/R/lib/SparkR/INDEX !? /Users/01004981/local/src/spark/myspark/R/lib/SparkR/help/AnIndex !? /Users/01004981/local/src/spark/myspark/R/lib/SparkR/html/00Index.html !? /Users/01004981/local/src/spark/myspark/R/lib/SparkR/html/R.css !? /Users/01004981/local/src/spark/myspark/R/pkg/man/DataFrame.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/GroupedData.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/agg.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/arrange.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/cache-methods.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/cacheTable.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/cancelJobGroup.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/clearCache.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/clearJobGroup.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/collect-methods.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/column.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/columns.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/count.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/createDataFrame.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/createExternalTable.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/describe.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/distinct.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/dropTempTable.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/dtypes.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/except.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/explain.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/filter.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/first.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/groupBy.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/hashCode.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/head.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/infer_type.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/insertInto.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/intersect.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/isLocal.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/join.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/jsonFile.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/limit.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/nafunctions.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/parquetFile.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/persist.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/print.jobj.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/print.structField.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/print.structType.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/printSchema.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/read.df.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/registerTempTable.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/repartition.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/sample.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/saveAsParquetFile.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/saveAsTable.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/schema.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/select.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/selectExpr.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/setJobGroup.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/show.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/showDF.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/sparkR.init.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/sparkR.stop.Rd !?
[jira] [Resolved] (SPARK-8554) Add the SparkR document files to `.rat-excludes` for `./dev/check-license`
[ https://issues.apache.org/jira/browse/SPARK-8554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-8554. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6947 [https://github.com/apache/spark/pull/6947] Add the SparkR document files to `.rat-excludes` for `./dev/check-license` -- Key: SPARK-8554 URL: https://issues.apache.org/jira/browse/SPARK-8554 Project: Spark Issue Type: Bug Components: SparkR, Tests Reporter: Yu Ishikawa Fix For: 1.5.0 {noformat} ./dev/check-license | grep -v boto Could not find Apache license headers in the following files: !? /Users/01004981/local/src/spark/myspark/R/lib/SparkR/INDEX !? /Users/01004981/local/src/spark/myspark/R/lib/SparkR/help/AnIndex !? /Users/01004981/local/src/spark/myspark/R/lib/SparkR/html/00Index.html !? /Users/01004981/local/src/spark/myspark/R/lib/SparkR/html/R.css !? /Users/01004981/local/src/spark/myspark/R/pkg/man/DataFrame.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/GroupedData.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/agg.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/arrange.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/cache-methods.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/cacheTable.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/cancelJobGroup.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/clearCache.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/clearJobGroup.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/collect-methods.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/column.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/columns.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/count.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/createDataFrame.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/createExternalTable.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/describe.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/distinct.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/dropTempTable.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/dtypes.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/except.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/explain.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/filter.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/first.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/groupBy.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/hashCode.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/head.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/infer_type.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/insertInto.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/intersect.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/isLocal.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/join.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/jsonFile.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/limit.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/nafunctions.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/parquetFile.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/persist.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/print.jobj.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/print.structField.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/print.structType.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/printSchema.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/read.df.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/registerTempTable.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/repartition.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/sample.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/saveAsParquetFile.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/saveAsTable.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/schema.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/select.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/selectExpr.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/setJobGroup.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/show.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/showDF.Rd !? /Users/01004981/local/src/spark/myspark/R/pkg/man/sparkR.init.Rd !?
[jira] [Assigned] (SPARK-8705) Javascript error in the web console when `totalExecutionTime` of a task is 0
[ https://issues.apache.org/jira/browse/SPARK-8705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8705: --- Assignee: (was: Apache Spark) Javascript error in the web console when `totalExecutionTime` of a task is 0 Key: SPARK-8705 URL: https://issues.apache.org/jira/browse/SPARK-8705 Project: Spark Issue Type: Bug Components: Web UI Reporter: Shixiong Zhu Because System.currentTimeMillis() is not accurate for tasks that only need several milliseconds, sometimes totalExecutionTime in makeTimeline will be 0. If totalExecutionTime is 0, there will the following error in the console. !https://cloud.githubusercontent.com/assets/1000778/8406776/5cd38e04-1e92-11e5-89f2-0c5134fe4b6b.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8705) Javascript error in the web console when `totalExecutionTime` of a task is 0
[ https://issues.apache.org/jira/browse/SPARK-8705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605870#comment-14605870 ] Apache Spark commented on SPARK-8705: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/7088 Javascript error in the web console when `totalExecutionTime` of a task is 0 Key: SPARK-8705 URL: https://issues.apache.org/jira/browse/SPARK-8705 Project: Spark Issue Type: Bug Components: Web UI Reporter: Shixiong Zhu Because System.currentTimeMillis() is not accurate for tasks that only need several milliseconds, sometimes totalExecutionTime in makeTimeline will be 0. If totalExecutionTime is 0, there will the following error in the console. !https://cloud.githubusercontent.com/assets/1000778/8406776/5cd38e04-1e92-11e5-89f2-0c5134fe4b6b.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8693) profiles and goals are not printed in a nice way
[ https://issues.apache.org/jira/browse/SPARK-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-8693: -- Affects Version/s: 1.5.0 profiles and goals are not printed in a nice way Key: SPARK-8693 URL: https://issues.apache.org/jira/browse/SPARK-8693 Project: Spark Issue Type: Sub-task Components: Build, Project Infra Affects Versions: 1.5.0 Reporter: Yin Huai Priority: Minor Fix For: 1.5.0 In our master build, I see {code} -Phadoop-1[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Dhadoop.version=1.0.4[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Pkinesis-asl[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Phive-thriftserver[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Phive[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: package[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: assembly/assembly[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: streaming-kafka-assembly/assembly {code} Seems we format the string in a wrong way? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8693) profiles and goals are not printed in a nice way
[ https://issues.apache.org/jira/browse/SPARK-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-8693. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7085 [https://github.com/apache/spark/pull/7085] profiles and goals are not printed in a nice way Key: SPARK-8693 URL: https://issues.apache.org/jira/browse/SPARK-8693 Project: Spark Issue Type: Sub-task Components: Build, Project Infra Reporter: Yin Huai Priority: Minor Fix For: 1.5.0 In our master build, I see {code} -Phadoop-1[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Dhadoop.version=1.0.4[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Pkinesis-asl[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Phive-thriftserver[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: -Phive[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: package[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: assembly/assembly[info] Building Spark (w/Hive 0.13.1) using SBT with these arguments: streaming-kafka-assembly/assembly {code} Seems we format the string in a wrong way? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8670) Nested columns can't be referenced (but they can be selected)
[ https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605822#comment-14605822 ] Nicholas Chammas commented on SPARK-8670: - Not sure. Does Scala offer the same flexibility in syntax like Python? Nested columns can't be referenced (but they can be selected) - Key: SPARK-8670 URL: https://issues.apache.org/jira/browse/SPARK-8670 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0 Reporter: Nicholas Chammas This is strange and looks like a regression from 1.3. {code} import json daterz = [ { 'name': 'Nick', 'stats': { 'age': 28 } }, { 'name': 'George', 'stats': { 'age': 31 } } ] df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x))) df.select('stats.age').show() df['stats.age'] # 1.4 fails on this line {code} On 1.3 this works and yields: {code} age 28 31 Out[1]: Columnstats.age AS age#2958L {code} On 1.4, however, this gives an error on the last line: {code} +---+ |age| +---+ | 28| | 31| +---+ --- IndexErrorTraceback (most recent call last) ipython-input-1-04bd990e94c6 in module() 19 20 df.select('stats.age').show() --- 21 df['stats.age'] /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item) 678 if isinstance(item, basestring): 679 if item not in self.columns: -- 680 raise IndexError(no such column: %s % item) 681 jc = self._jdf.apply(item) 682 return Column(jc) IndexError: no such column: stats.age {code} This means, among other things, that you can't join DataFrames on nested columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8705) Javascript error in the web console when `totalExecutionTime` of a task is 0
[ https://issues.apache.org/jira/browse/SPARK-8705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8705: --- Assignee: Apache Spark Javascript error in the web console when `totalExecutionTime` of a task is 0 Key: SPARK-8705 URL: https://issues.apache.org/jira/browse/SPARK-8705 Project: Spark Issue Type: Bug Components: Web UI Reporter: Shixiong Zhu Assignee: Apache Spark Because System.currentTimeMillis() is not accurate for tasks that only need several milliseconds, sometimes totalExecutionTime in makeTimeline will be 0. If totalExecutionTime is 0, there will the following error in the console. !https://cloud.githubusercontent.com/assets/1000778/8406776/5cd38e04-1e92-11e5-89f2-0c5134fe4b6b.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8707) RDD#toDebugString fails if any cached RDD has invalid partitions
[ https://issues.apache.org/jira/browse/SPARK-8707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-8707: -- Summary: RDD#toDebugString fails if any cached RDD has invalid partitions (was: RDD#toDebugString fails if any cached RDD is invalid) RDD#toDebugString fails if any cached RDD has invalid partitions Key: SPARK-8707 URL: https://issues.apache.org/jira/browse/SPARK-8707 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0, 1.4.1 Reporter: Aaron Davidson Labels: starter Repro: {code} sc.parallelize(0 until 100).toDebugString sc.textFile(/ThisFileDoesNotExist).cache() sc.parallelize(0 until 100).toDebugString {code} Output: {code} java.io.IOException: Not a file: /ThisFileDoesNotExist at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:59) at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455) at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.SparkContext.getRDDStorageInfo(SparkContext.scala:1455) at org.apache.spark.rdd.RDD.debugSelf$1(RDD.scala:1573) at org.apache.spark.rdd.RDD.firstDebugString$1(RDD.scala:1607) at org.apache.spark.rdd.RDD.toDebugString(RDD.scala:1637 {code} This is because toDebugString gets all the partitions from all RDDs, which fails (via SparkContext#getRDDStorageInfo). This pathway should definitely be resilient to other RDDs being invalid (and getRDDStorageInfo should probably also be). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8707) RDD#toDebugString fails if any cached RDD has invalid partitions
[ https://issues.apache.org/jira/browse/SPARK-8707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-8707: -- Description: Repro: {code} sc.textFile(/ThisFileDoesNotExist).cache() sc.parallelize(0 until 100).toDebugString {code} Output: {code} java.io.IOException: Not a file: /ThisFileDoesNotExist at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:59) at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455) at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.SparkContext.getRDDStorageInfo(SparkContext.scala:1455) at org.apache.spark.rdd.RDD.debugSelf$1(RDD.scala:1573) at org.apache.spark.rdd.RDD.firstDebugString$1(RDD.scala:1607) at org.apache.spark.rdd.RDD.toDebugString(RDD.scala:1637 {code} This is because toDebugString gets all the partitions from all RDDs, which fails (via SparkContext#getRDDStorageInfo). This pathway should definitely be resilient to other RDDs being invalid (and getRDDStorageInfo should probably also be). was: Repro: {code} sc.parallelize(0 until 100).toDebugString sc.textFile(/ThisFileDoesNotExist).cache() sc.parallelize(0 until 100).toDebugString {code} Output: {code} java.io.IOException: Not a file: /ThisFileDoesNotExist at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:59) at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455) at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.SparkContext.getRDDStorageInfo(SparkContext.scala:1455) at org.apache.spark.rdd.RDD.debugSelf$1(RDD.scala:1573) at org.apache.spark.rdd.RDD.firstDebugString$1(RDD.scala:1607) at org.apache.spark.rdd.RDD.toDebugString(RDD.scala:1637 {code} This is because toDebugString gets all the partitions from all RDDs, which fails (via SparkContext#getRDDStorageInfo). This pathway should definitely be resilient to other RDDs being invalid (and getRDDStorageInfo should probably also be). RDD#toDebugString fails if any cached RDD has invalid partitions
[jira] [Created] (SPARK-8707) RDD#toDebugString fails if any cached RDD is invalid
Aaron Davidson created SPARK-8707: - Summary: RDD#toDebugString fails if any cached RDD is invalid Key: SPARK-8707 URL: https://issues.apache.org/jira/browse/SPARK-8707 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0, 1.4.1 Reporter: Aaron Davidson Repro: {code} sc.parallelize(0 until 100).toDebugString sc.textFile(/ThisFileDoesNotExist).cache() sc.parallelize(0 until 100).toDebugString {code} Output: {code} java.io.IOException: Not a file: /ThisFileDoesNotExist at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:59) at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455) at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.SparkContext.getRDDStorageInfo(SparkContext.scala:1455) at org.apache.spark.rdd.RDD.debugSelf$1(RDD.scala:1573) at org.apache.spark.rdd.RDD.firstDebugString$1(RDD.scala:1607) at org.apache.spark.rdd.RDD.toDebugString(RDD.scala:1637 {code} This is because toDebugString gets all the partitions from all RDDs, which fails (via SparkContext#getRDDStorageInfo). This pathway should definitely be resilient to other RDDs being invalid (and getRDDStorageInfo should probably also be). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8708) MatrixFactorizationModel.predictAll() populates single partition only
Antony Mayi created SPARK-8708: -- Summary: MatrixFactorizationModel.predictAll() populates single partition only Key: SPARK-8708 URL: https://issues.apache.org/jira/browse/SPARK-8708 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Antony Mayi When using mllib.recommendation.ALS the RDD returned by .predictAll() has all values pushed into single partition despite using quite high parallelism. This degrades performance of further processing (I can obviously run .partitionBy()) to balance it but that's still too costly (ie if running .predictAll() in loop for thousands of products) and should be possible to do it rather somehow on the model (automatically)). Bellow is an example on tiny sample (same on large dataset): {code:title=pyspark} r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1, 2.0) r4 = (2, 2, 2.0) r5 = (3, 1, 1.0) ratings = sc.parallelize([r1, r2, r3, r4, r5], 5) ratings.getNumPartitions() 5 users = ratings.map(itemgetter(0)).distinct() model = ALS.trainImplicit(ratings, 1, seed=10) predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2))) predictions_for_2.glom().map(len).collect() [0, 0, 3, 0, 0] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8372) History server shows incorrect information for application not started
[ https://issues.apache.org/jira/browse/SPARK-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605996#comment-14605996 ] Andrew Or commented on SPARK-8372: -- OK, per discussion on the #6827 I reverted this. History server shows incorrect information for application not started -- Key: SPARK-8372 URL: https://issues.apache.org/jira/browse/SPARK-8372 Project: Spark Issue Type: Bug Components: Deploy, Web UI Affects Versions: 1.4.0 Reporter: Carson Wang Assignee: Marcelo Vanzin Priority: Minor Attachments: IncorrectAppInfo.png The history server may show an incorrect App ID for an incomplete application like App ID.inprogress. This app info will never disappear even after the app is completed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-8372) History server shows incorrect information for application not started
[ https://issues.apache.org/jira/browse/SPARK-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reopened SPARK-8372: -- Assignee: Marcelo Vanzin (was: Carson Wang) History server shows incorrect information for application not started -- Key: SPARK-8372 URL: https://issues.apache.org/jira/browse/SPARK-8372 Project: Spark Issue Type: Bug Components: Deploy, Web UI Affects Versions: 1.4.0 Reporter: Carson Wang Assignee: Marcelo Vanzin Priority: Minor Attachments: IncorrectAppInfo.png The history server may show an incorrect App ID for an incomplete application like App ID.inprogress. This app info will never disappear even after the app is completed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8705) Javascript error in the web console when `totalExecutionTime` of a task is 0
[ https://issues.apache.org/jira/browse/SPARK-8705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8705: - Affects Version/s: 1.4.0 Javascript error in the web console when `totalExecutionTime` of a task is 0 Key: SPARK-8705 URL: https://issues.apache.org/jira/browse/SPARK-8705 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: Shixiong Zhu Because System.currentTimeMillis() is not accurate for tasks that only need several milliseconds, sometimes totalExecutionTime in makeTimeline will be 0. If totalExecutionTime is 0, there will the following error in the console. !https://cloud.githubusercontent.com/assets/1000778/8406776/5cd38e04-1e92-11e5-89f2-0c5134fe4b6b.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8705) Javascript error in the web console when `totalExecutionTime` of a task is 0
[ https://issues.apache.org/jira/browse/SPARK-8705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8705: - Target Version/s: 1.5.0, 1.4.2 Javascript error in the web console when `totalExecutionTime` of a task is 0 Key: SPARK-8705 URL: https://issues.apache.org/jira/browse/SPARK-8705 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: Shixiong Zhu Because System.currentTimeMillis() is not accurate for tasks that only need several milliseconds, sometimes totalExecutionTime in makeTimeline will be 0. If totalExecutionTime is 0, there will the following error in the console. !https://cloud.githubusercontent.com/assets/1000778/8406776/5cd38e04-1e92-11e5-89f2-0c5134fe4b6b.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8622) Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor classpath
[ https://issues.apache.org/jira/browse/SPARK-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605921#comment-14605921 ] Baswaraj commented on SPARK-8622: - Thats what i mean. Jars specified by --jars are not put on classpath, but are in working directory of executor. I am expecting either jars to be on classpath or working directory to be on classpath. In 1.3.0, working directory is in classpath. In 1.3.1 + neither jars nor working directory on classpath. Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor classpath -- Key: SPARK-8622 URL: https://issues.apache.org/jira/browse/SPARK-8622 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.3.1, 1.4.0 Reporter: Baswaraj I ran into an issue that executor not able to pickup my configs/ function from my custom jar in standalone (client/cluster) deploy mode. I have used spark-submit --Jar option to specify all my jars and configs to be used by executors. all these files are placed in working directory of executor, but not in executor classpath. Also, executor working directory is not in executor classpath. I am expecting executor to find all files specified in spark-submit --jar options . In spark 1.3.0 executor working directory is in executor classpath, so app runs successfully. To successfully run my application with spark 1.3.1 +, i have to use following option (conf/spark-defaults.conf) spark.executor.extraClassPath . Please advice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3692) RBF Kernel implementation to SVM
[ https://issues.apache.org/jira/browse/SPARK-3692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605930#comment-14605930 ] Seth Hendrickson commented on SPARK-3692: - It looks like this JIRA will be taken care of by [SPARK-4638|https://issues.apache.org/jira/browse/SPARK-4638]. I suspect this should be closed as SPARK-4638 contains significant work in progress. RBF Kernel implementation to SVM Key: SPARK-3692 URL: https://issues.apache.org/jira/browse/SPARK-3692 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ekrem Aksoy Priority: Minor Radial Basis Function is another type of kernel that can be used instead of linear kernel in SVM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8567) Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars
[ https://issues.apache.org/jira/browse/SPARK-8567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8567: - Fix Version/s: 1.4.1 Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars -- Key: SPARK-8567 URL: https://issues.apache.org/jira/browse/SPARK-8567 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 1.4.1 Reporter: Yin Huai Assignee: Yin Huai Priority: Critical Labels: flaky-test Fix For: 1.4.1, 1.5.0 Seems tests in HiveSparkSubmitSuite fail with timeout pretty frequently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8410) Hive VersionsSuite RuntimeException
[ https://issues.apache.org/jira/browse/SPARK-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605977#comment-14605977 ] Burak Yavuz commented on SPARK-8410: Hi Joe, Is it possible to delete those files (~/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml) from the faulty servers? Maybe it would be better to have Spark delete it beforehand. That would however mean that the resolution phase will always take a while, because the whereabouts of the artifacts are never cached. Hive VersionsSuite RuntimeException --- Key: SPARK-8410 URL: https://issues.apache.org/jira/browse/SPARK-8410 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1, 1.4.0 Environment: IBM Power system - P7 running Ubuntu 14.04LE Reporter: Josiah Samuel Sathiadass Assignee: Burak Yavuz Priority: Minor While testing Spark Project Hive, there are RuntimeExceptions as follows, VersionsSuite: - success sanity check *** FAILED *** java.lang.RuntimeException: [download failed: org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed: org.codehaus.groovy#groovy-all;2.1.6!groovy-all.jar, download failed: asm#asm;3.2!asm.jar] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978) at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62) at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62) at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:38) at org.apache.spark.sql.hive.client.IsolatedClientLoader$.org$apache$spark$sql$hive$client$IsolatedClientLoader$$downloadVersion(IsolatedClientLoader.scala:61) at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$1.apply(IsolatedClientLoader.scala:44) at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$1.apply(IsolatedClientLoader.scala:44) at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189) at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91) at org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:44) ... The tests are executed with the following set of options, build/mvn --pl sql/hive --fail-never -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 test Adding the following dependencies in the spark/sql/hive/pom.xml file solves this issue, dependency groupIdorg.jboss.netty/groupId artifactIdnetty/artifactId version3.2.2.Final/version scopetest/scope /dependency dependency groupIdorg.codehaus.groovy/groupId artifactIdgroovy-all/artifactId version2.1.6/version scopetest/scope /dependency dependency groupIdasm/groupId artifactIdasm/artifactId version3.2/version scopetest/scope /dependency The question is, Is this the correct way to fix this runtimeException ? If yes, Can a pull request fix this issue permanently ? If not, suggestions please. Updates: The above mentioned quick fix is not working with the latest 1.4 because of this pull commits : [SPARK-8095] Resolve dependencies of --packages in local ivy cache #6788 https://github.com/apache/spark/pull/6788 Due to this above commit, now the lookup directories during testing phase has changed as follows, :: problems summary :: WARNINGS [NOT FOUND ] org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle) (2ms) local-m2-cache: tried file:/home/joe/sparkibmsoe/spark/sql/hive/dummy/.m2/repository/org/jboss/netty/netty/3.2.2.Final/netty-3.2.2.Final.jar [NOT FOUND ] org.codehaus.groovy#groovy-all;2.1.6!groovy-all.jar (0ms) local-m2-cache: tried file:/home/joe/sparkibmsoe/spark/sql/hive/dummy/.m2/repository/org/codehaus/groovy/groovy-all/2.1.6/groovy-all-2.1.6.jar [NOT FOUND ] asm#asm;3.2!asm.jar (0ms) local-m2-cache: tried file:/home/joe/sparkibmsoe/spark/sql/hive/dummy/.m2/repository/asm/asm/3.2/asm-3.2.jar :: :: FAILED DOWNLOADS:: :: ^ see resolution messages for details ^ :: :: :: org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle) :: org.codehaus.groovy#groovy-all;2.1.6!groovy-all.jar :: asm#asm;3.2!asm.jar :: -- This message was sent by Atlassian
[jira] [Updated] (SPARK-8372) History server shows incorrect information for application not started
[ https://issues.apache.org/jira/browse/SPARK-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8372: - Fix Version/s: (was: 1.4.1) (was: 1.5.0) History server shows incorrect information for application not started -- Key: SPARK-8372 URL: https://issues.apache.org/jira/browse/SPARK-8372 Project: Spark Issue Type: Bug Components: Deploy, Web UI Affects Versions: 1.4.0 Reporter: Carson Wang Assignee: Marcelo Vanzin Priority: Minor Attachments: IncorrectAppInfo.png The history server may show an incorrect App ID for an incomplete application like App ID.inprogress. This app info will never disappear even after the app is completed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8567) Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars
[ https://issues.apache.org/jira/browse/SPARK-8567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8567. Resolution: Fixed Fix Version/s: 1.5.0 Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars -- Key: SPARK-8567 URL: https://issues.apache.org/jira/browse/SPARK-8567 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 1.4.1 Reporter: Yin Huai Assignee: Yin Huai Priority: Critical Labels: flaky-test Fix For: 1.5.0 Seems tests in HiveSparkSubmitSuite fail with timeout pretty frequently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8567) Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars
[ https://issues.apache.org/jira/browse/SPARK-8567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8567: - Target Version/s: 1.4.1, 1.5.0 (was: 1.5.0) Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars -- Key: SPARK-8567 URL: https://issues.apache.org/jira/browse/SPARK-8567 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 1.4.1 Reporter: Yin Huai Assignee: Yin Huai Priority: Critical Labels: flaky-test Fix For: 1.4.1, 1.5.0 Seems tests in HiveSparkSubmitSuite fail with timeout pretty frequently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8475) SparkSubmit with Ivy jars is very slow to load with no internet access
[ https://issues.apache.org/jira/browse/SPARK-8475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605968#comment-14605968 ] Burak Yavuz commented on SPARK-8475: ping. I think you can go ahead with a PR for option 1. If you're too busy, I can submit one! SparkSubmit with Ivy jars is very slow to load with no internet access -- Key: SPARK-8475 URL: https://issues.apache.org/jira/browse/SPARK-8475 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.4.0 Reporter: Nathan McCarthy Priority: Minor Spark Submit adds maven central spark bintray to the ChainResolver before it adds any external resolvers. https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L821 When running on a cluster without internet access, this means the spark shell takes forever to launch as it tries these two remote repos before the ones specified in the --repositories list. In our case we have a proxy which the cluster can access it and supply it via --repositories. This is also a problem for users who maintain a proxy for maven/ivy repos with something like Nexus/Artifactory. Having a repo proxy is popular at many organisations so I'd say this would be a useful change for these users as well. In the current state even if a maven central proxy is supplied, it will still try and hit central. I see two options for a fix; * Change the order repos are added to the ChainResolver, making the --repositories supplied repos come before anything else. https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L843 * Have a config option (like spark.jars.ivy.useDefaultRemoteRepos, default true) which when false wont add the maven central bintry to the ChainResolver. Happy to do a PR for this fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-701) Wrong SPARK_MEM setting with different EC2 master and worker machine types
[ https://issues.apache.org/jira/browse/SPARK-701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606622#comment-14606622 ] Shivaram Venkataraman commented on SPARK-701: - Yeah so SPARK_MEM used to be used for both master and executors before. Right now we have two separate variables spark.executor.memory and spark.driver.memory that we can set. Lets open a new issue for this. Wrong SPARK_MEM setting with different EC2 master and worker machine types -- Key: SPARK-701 URL: https://issues.apache.org/jira/browse/SPARK-701 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 0.7.0 Reporter: Josh Rosen Assignee: Shivaram Venkataraman Fix For: 0.7.0 When launching a spark-ec2 cluster using different worker and master machine types, SPARK_MEM in spark-env.sh is set based on the master's memory instead of the worker's. This causes jobs to hang if the master has more memory than the workers (because jobs will request too much memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8588) Could not use concat with UDF in where clause
[ https://issues.apache.org/jira/browse/SPARK-8588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606650#comment-14606650 ] Apache Spark commented on SPARK-8588: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/7103 Could not use concat with UDF in where clause - Key: SPARK-8588 URL: https://issues.apache.org/jira/browse/SPARK-8588 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Environment: Centos 7, java 1.7.0_67, scala 2.10.5, run in a spark standalone cluster(or local). Reporter: StanZhai Assignee: Wenchen Fan Priority: Critical After upgraded the cluster from spark 1.3.1 to 1.4.0(rc4), I encountered the following exception when use concat with UDF in where clause: {code} org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'concat(HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(date#1776),年) at org.apache.spark.sql.catalyst.analysis.UnresolvedFunction.dataType(unresolved.scala:82) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299) at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) at scala.collection.immutable.List.exists(List.scala:84) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:299) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:75) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:85) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:94) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:64) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:136) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:135) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at
[jira] [Commented] (SPARK-8716) Remove executor shared cache feature
[ https://issues.apache.org/jira/browse/SPARK-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606649#comment-14606649 ] Marcelo Vanzin commented on SPARK-8716: --- bq. AFAIK this feature doesn't work under YARN or Mesos. I haven't checked recently but I believe it works on YARN. YARN behaves similarly in that there is a shared app dir (or dirs depending on YARN's config). But off the top of my head I don't remember whether Spark points at the app dir or the container dir for its own temp files. Remove executor shared cache feature Key: SPARK-8716 URL: https://issues.apache.org/jira/browse/SPARK-8716 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Josh Rosen Priority: Minor More specifically, this is the feature that is currently flagged by `spark.files.useFetchCache`. There are several reasons why we should remove it. (1) It doesn't even work. Recently, each executor gets its own unique temp directory for security reasons. (2) There is no way to fix it. The constraints in (1) are fundamentally opposed to sharing resources across executors. (3) It is very complex. The method Utils.fetchFile would be greatly simplified without this feature that is not even used. (4) There are no tests for it and it is difficult to test. Note that we can't just revert the respective patches because they were merged a long time ago. Related issues: SPARK-8130, SPARK-6313, SPARK-2713 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8716) Remove executor shared cache feature
[ https://issues.apache.org/jira/browse/SPARK-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8716: - Priority: Major (was: Minor) Remove executor shared cache feature Key: SPARK-8716 URL: https://issues.apache.org/jira/browse/SPARK-8716 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Josh Rosen More specifically, this is the feature that is currently flagged by `spark.files.useFetchCache`. There are several reasons why we should remove it. (1) It doesn't even work. Recently, each executor gets its own unique temp directory for security reasons. (2) There is no way to fix it. The constraints in (1) are fundamentally opposed to sharing resources across executors. (3) It is very complex. The method Utils.fetchFile would be greatly simplified without this feature that is not even used. (4) There are no tests for it and it is difficult to test. Note that we can't just revert the respective patches because they were merged a long time ago. Related issues: SPARK-8130, SPARK-6313, SPARK-2713 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8669) Parquet 1.7 files that store binary enums crash when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-8669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8669: -- Target Version/s: 1.5.0 Parquet 1.7 files that store binary enums crash when inferring schema - Key: SPARK-8669 URL: https://issues.apache.org/jira/browse/SPARK-8669 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Steven She Assignee: Steven She Loading a Parquet 1.7 file that contains a binary ENUM field in Spark 1.5-SNAPSHOT crashes with the following exception: {noformat} org.apache.spark.sql.AnalysisException: Illegal Parquet type: BINARY (ENUM); at org.apache.spark.sql.parquet.CatalystSchemaConverter.illegalType$1(CatalystSchemaConverter.scala:129) at org.apache.spark.sql.parquet.CatalystSchemaConverter.convertPrimitiveField(CatalystSchemaConverter.scala:184) at org.apache.spark.sql.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:114) ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4069) [SPARK-YARN] ApplicationMaster should release all executors' containers before unregistering itself from Yarn RM
[ https://issues.apache.org/jira/browse/SPARK-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4069. Resolution: Won't Fix [SPARK-YARN] ApplicationMaster should release all executors' containers before unregistering itself from Yarn RM Key: SPARK-4069 URL: https://issues.apache.org/jira/browse/SPARK-4069 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Min Zhou Curently, ApplciationMaster in yarn mode simply unregister itself from yarn master , a.k.a resourcemanager. Itnever release executors' containers before that. Yarn's master will make a decision to kill all the executors' containers if it face such scenario. so the log of resourcemanager is like below {noformat} 2014-10-22 23:39:09,903 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Processing event for appattempt_1414003182949_0004_01 of type UNREGISTERED 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1414003182949_0004_01 State change from RUNNING to FINAL_SAVING 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating application application_1414003182949_0004 with final state: FINISHING 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1414003182949_0004 State change from RUNNING to FINAL_SAVING 2014-10-22 23:39:09,903 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Processing event for appattempt_1414003182949_0004_01 of type ATTEMPT_UPDATE_SAVED 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1414003182949_0004 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1414003182949_0004_01 State change from FINAL_SAVING to FINISHING 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1414003182949_0004 State change from FINAL_SAVING to FINISHING 2014-10-22 23:39:10,485 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Processing event for appattempt_1414003182949_0004_01 of type CONTAINER_FINISHED 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1414003182949_0004_01_01 Container Transitioned from RUNNING to COMPLETED 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1414003182949_0004_01 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: Completed container: container_1414003182949_0004_01_01 in state: COMPLETED event:FINISHED 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: Finish information of container container_1414003182949_0004_01_01 is written 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1414003182949_0004_01 State change from FINISHING to FINISHED 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1414003182949_0004 CONTAINERID=container_1414003182949_0004_01_01 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: Stored the finish data of container container_1414003182949_0004_01_01 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Released container container_1414003182949_0004_01_01 of capacity memory:3072, vCores:1 on host host1, which currently has 0 containers, memory:0, vCores:0 used and memory:241901, vCores:32 available, release resources=true 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1414003182949_0004 State change from FINISHING to FINISHED 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: Finish information of application attempt appattempt_1414003182949_0004_01 is written 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim OPERATION=Application Finished - Succeeded
[jira] [Commented] (SPARK-4069) [SPARK-YARN] ApplicationMaster should release all executors' containers before unregistering itself from Yarn RM
[ https://issues.apache.org/jira/browse/SPARK-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606701#comment-14606701 ] Andrew Or commented on SPARK-4069: -- Fixed in YARN-3415. [SPARK-YARN] ApplicationMaster should release all executors' containers before unregistering itself from Yarn RM Key: SPARK-4069 URL: https://issues.apache.org/jira/browse/SPARK-4069 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Min Zhou Curently, ApplciationMaster in yarn mode simply unregister itself from yarn master , a.k.a resourcemanager. Itnever release executors' containers before that. Yarn's master will make a decision to kill all the executors' containers if it face such scenario. so the log of resourcemanager is like below {noformat} 2014-10-22 23:39:09,903 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Processing event for appattempt_1414003182949_0004_01 of type UNREGISTERED 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1414003182949_0004_01 State change from RUNNING to FINAL_SAVING 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating application application_1414003182949_0004 with final state: FINISHING 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1414003182949_0004 State change from RUNNING to FINAL_SAVING 2014-10-22 23:39:09,903 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Processing event for appattempt_1414003182949_0004_01 of type ATTEMPT_UPDATE_SAVED 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1414003182949_0004 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1414003182949_0004_01 State change from FINAL_SAVING to FINISHING 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1414003182949_0004 State change from FINAL_SAVING to FINISHING 2014-10-22 23:39:10,485 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Processing event for appattempt_1414003182949_0004_01 of type CONTAINER_FINISHED 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1414003182949_0004_01_01 Container Transitioned from RUNNING to COMPLETED 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1414003182949_0004_01 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: Completed container: container_1414003182949_0004_01_01 in state: COMPLETED event:FINISHED 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: Finish information of container container_1414003182949_0004_01_01 is written 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1414003182949_0004_01 State change from FINISHING to FINISHED 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1414003182949_0004 CONTAINERID=container_1414003182949_0004_01_01 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: Stored the finish data of container container_1414003182949_0004_01_01 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Released container container_1414003182949_0004_01_01 of capacity memory:3072, vCores:1 on host host1, which currently has 0 containers, memory:0, vCores:0 used and memory:241901, vCores:32 available, release resources=true 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1414003182949_0004 State change from FINISHING to FINISHED 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: Finish information of application attempt appattempt_1414003182949_0004_01 is written 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim
[jira] [Closed] (SPARK-8634) Fix flaky test StreamingListenerSuite receiver info reporting
[ https://issues.apache.org/jira/browse/SPARK-8634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-8634. Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 1.4.2 1.5.0 Target Version/s: 1.5.0, 1.4.2 Fix flaky test StreamingListenerSuite receiver info reporting --- Key: SPARK-8634 URL: https://issues.apache.org/jira/browse/SPARK-8634 Project: Spark Issue Type: Test Components: Streaming, Tests Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Critical Labels: flaky-test Fix For: 1.5.0, 1.4.2 As per the unit test log in https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35754/ {code} 15/06/24 23:09:10.210 Thread-3495 INFO ReceiverTracker: Starting 1 receivers 15/06/24 23:09:10.270 Thread-3495 INFO SparkContext: Starting job: apply at Transformer.scala:22 ... 15/06/24 23:09:14.259 ForkJoinPool-4-worker-29 INFO StreamingListenerSuiteReceiver: Started receiver and sleeping 15/06/24 23:09:14.270 ForkJoinPool-4-worker-29 INFO StreamingListenerSuiteReceiver: Reporting error and sleeping {code} it needs at least 4 seconds to receive all receiver events in this slow machine, but `timeout` for `eventually` is only 2 seconds. We can increase `timeout` to make this test stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8119) HeartbeatReceiver should not call sc.killExecutor
[ https://issues.apache.org/jira/browse/SPARK-8119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-8119: - Summary: HeartbeatReceiver should not call sc.killExecutor (was: Spark will set total executor when some executors fail.) HeartbeatReceiver should not call sc.killExecutor - Key: SPARK-8119 URL: https://issues.apache.org/jira/browse/SPARK-8119 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.4.0 Reporter: SaintBacchus DynamicAllocation will set the total executor to a little number when it wants to kill some executors. But in no-DynamicAllocation scenario, Spark will also set the total executor. So it will cause such problem: sometimes an executor fails down, there is no more executor which will be pull up by spark. === EDIT by andrewor14 === The issue is that the AM forgets about the original number of executors it wants after calling sc.killExecutor. Even if dynamic allocation is not enabled, this is still possible because of heartbeat timeouts. I think the problem is that sc.killExecutor is used incorrectly in HeartbeatReceiver. The intention of the method is to permanently adjust the number of executors the application will get. In HeartbeatReceiver, however, this is used as a best-effort mechanism to ensure that the timed out executor is dead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8374) Job frequently hangs after YARN preemption
[ https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605425#comment-14605425 ] Shay Rojansky commented on SPARK-8374: -- Thanks for your comment and sure, I can help test. I may need a bit of hand-holding since I haven't built Spark yet. Job frequently hangs after YARN preemption -- Key: SPARK-8374 URL: https://issues.apache.org/jira/browse/SPARK-8374 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.0 Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04 Reporter: Shay Rojansky Priority: Critical After upgrading to Spark 1.4.0, jobs that get preempted very frequently will not reacquire executors and will therefore hang. To reproduce: 1. I run Spark job A that acquires all grid resources 2. I run Spark job B in a higher-priority queue that acquires all grid resources. Job A is fully preempted. 3. Kill job B, releasing all resources 4. Job A should at this point reacquire all grid resources, but occasionally doesn't. Repeating the preemption scenario makes the bad behavior occur within a few attempts. (see logs at bottom). Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption issues, maybe the work there is related to the new issues. The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've downgraded to 1.3.1 just because of this issue). Logs -- When job B (the preemptor first acquires an application master, the following is logged by job A (the preemptee): {noformat} ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc client disassociated INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost) INFO DAGScheduler: Executor lost: 447 (epoch 0) INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from BlockManagerMaster. INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, g023.grid.eaglerd.local, 41406) INFO BlockManagerMaster: Removed 447 successfully in removeExecutor {noformat} (It's strange for errors/warnings to be logged for preemption) Later, when job B's AM starts requesting its resources, I get lots of the following in job A: {noformat} ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc client disassociated INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0 WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost) WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. {noformat} Finally, when I kill job B, job A emits lots of the following: {noformat} INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31 WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist! {noformat} And finally after some time: {noformat} WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 165964 ms exceeds timeout 12 ms ERROR YarnScheduler: Lost an executor 466 (already removed): Executor heartbeat timed out after 165964 ms {noformat} At this point the job never requests/acquires more resources and hangs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8702) Avoid massive concating strings in Javascript
[ https://issues.apache.org/jira/browse/SPARK-8702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8702: --- Assignee: Apache Spark Avoid massive concating strings in Javascript - Key: SPARK-8702 URL: https://issues.apache.org/jira/browse/SPARK-8702 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Shixiong Zhu Assignee: Apache Spark When there are massive tasks, such as {{sc.parallelize(1 to 10, 1).count()}}, the generated JS codes have a lot of string concatenations in the stage page, nearly 40 string concatenations for one task. We can generate the whole string for a task instead of execution string concatenations in the browser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8702) Avoid massive concating strings in Javascript
[ https://issues.apache.org/jira/browse/SPARK-8702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8702: --- Assignee: (was: Apache Spark) Avoid massive concating strings in Javascript - Key: SPARK-8702 URL: https://issues.apache.org/jira/browse/SPARK-8702 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Shixiong Zhu When there are massive tasks, such as {{sc.parallelize(1 to 10, 1).count()}}, the generated JS codes have a lot of string concatenations in the stage page, nearly 40 string concatenations for one task. We can generate the whole string for a task instead of execution string concatenations in the browser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8702) Avoid massive concating strings in Javascript
[ https://issues.apache.org/jira/browse/SPARK-8702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605452#comment-14605452 ] Apache Spark commented on SPARK-8702: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/7082 Avoid massive concating strings in Javascript - Key: SPARK-8702 URL: https://issues.apache.org/jira/browse/SPARK-8702 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Shixiong Zhu When there are massive tasks, such as {{sc.parallelize(1 to 10, 1).count()}}, the generated JS codes have a lot of string concatenations in the stage page, nearly 40 string concatenations for one task. We can generate the whole string for a task instead of execution string concatenations in the browser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8703) Add CountVectorizer as a ml transformer to convert document to words count vector
[ https://issues.apache.org/jira/browse/SPARK-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8703: --- Assignee: Apache Spark Add CountVectorizer as a ml transformer to convert document to words count vector - Key: SPARK-8703 URL: https://issues.apache.org/jira/browse/SPARK-8703 Project: Spark Issue Type: New Feature Components: ML Reporter: yuhao yang Assignee: Apache Spark Original Estimate: 24h Remaining Estimate: 24h Converts a text document to a sparse vector of token counts. I can further add an estimator to extract vocabulary from corpus if that's appropriate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8703) Add CountVectorizer as a ml transformer to convert document to words count vector
[ https://issues.apache.org/jira/browse/SPARK-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-8703: -- Description: Converts a text document to a sparse vector of token counts. Similar to http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html I can further add an estimator to extract vocabulary from corpus if that's appropriate. was: Converts a text document to a sparse vector of token counts. I can further add an estimator to extract vocabulary from corpus if that's appropriate. Add CountVectorizer as a ml transformer to convert document to words count vector - Key: SPARK-8703 URL: https://issues.apache.org/jira/browse/SPARK-8703 Project: Spark Issue Type: New Feature Components: ML Reporter: yuhao yang Original Estimate: 24h Remaining Estimate: 24h Converts a text document to a sparse vector of token counts. Similar to http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html I can further add an estimator to extract vocabulary from corpus if that's appropriate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8703) Add CountVectorizer as a ml transformer to convert document to words count vector
[ https://issues.apache.org/jira/browse/SPARK-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8703: --- Assignee: (was: Apache Spark) Add CountVectorizer as a ml transformer to convert document to words count vector - Key: SPARK-8703 URL: https://issues.apache.org/jira/browse/SPARK-8703 Project: Spark Issue Type: New Feature Components: ML Reporter: yuhao yang Original Estimate: 24h Remaining Estimate: 24h Converts a text document to a sparse vector of token counts. I can further add an estimator to extract vocabulary from corpus if that's appropriate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8235) misc function: sha1 / sha
[ https://issues.apache.org/jira/browse/SPARK-8235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8235. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6963 [https://github.com/apache/spark/pull/6963] misc function: sha1 / sha - Key: SPARK-8235 URL: https://issues.apache.org/jira/browse/SPARK-8235 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Fix For: 1.5.0 sha1(string/binary): string sha(string/binary): string Calculates the SHA-1 digest for string or binary and returns the value as a hex string (as of Hive 1.3.0). Example: sha1('ABC') = '3c01bdbb26f358bab27f267924aa2c9a03fcfdb8'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6830) Memoize frequently queried vals in RDD, such as numPartitions, count etc.
[ https://issues.apache.org/jira/browse/SPARK-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606157#comment-14606157 ] Sean Owen commented on SPARK-6830: -- Is this valid? For example, consider an RDD from a file that's being written to. count() would return larger values each time it is called. Caching it would change this behavior. Of course, caching the RDD would also mean the count was then fixed, but these are semantically different. Memoize frequently queried vals in RDD, such as numPartitions, count etc. - Key: SPARK-6830 URL: https://issues.apache.org/jira/browse/SPARK-6830 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor Labels: Starter We should memoize frequently queried vals in RDD, such as numPartitions, count etc. While using SparkR in RStudio, the `count` function seems to be called frequently by the IDE – I think this is to show some stats about variables in the workspace etc. but this is not great in SparkR as we trigger a job every time count is called. Memoization would help in this case, but we should also see if there is some better way to interact with RStudio. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5571) LDA should handle text as well
[ https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606161#comment-14606161 ] Alok Singh edited comment on SPARK-5571 at 6/29/15 7:00 PM: I would like to work to it. was (Author: aloknsingh): I would like to work to it if everyone is ok . LDA should handle text as well -- Key: SPARK-5571 URL: https://issues.apache.org/jira/browse/SPARK-5571 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts. It should also supporting training and prediction using text (Strings). This plan is sketched in the [original LDA design doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]. There should be: * runWithText() method which takes an RDD with a collection of Strings (bags of words). This will also index terms and compute a dictionary. * dictionary parameter for when LDA is run with word count vectors * prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which is commented out in LDA currently) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5571) LDA should handle text as well
[ https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606159#comment-14606159 ] Alok Singh commented on SPARK-5571: --- Since there is already Tokenizer class. We can assume other classes will be made. so one I can assume that input is already tokenized, stemmed and stopword removed. LDA should handle text as well -- Key: SPARK-5571 URL: https://issues.apache.org/jira/browse/SPARK-5571 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts. It should also supporting training and prediction using text (Strings). This plan is sketched in the [original LDA design doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]. There should be: * runWithText() method which takes an RDD with a collection of Strings (bags of words). This will also index terms and compute a dictionary. * dictionary parameter for when LDA is run with word count vectors * prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which is commented out in LDA currently) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5571) LDA should handle text as well
[ https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606161#comment-14606161 ] Alok Singh commented on SPARK-5571: --- I would like to work to it if everyone is ok . LDA should handle text as well -- Key: SPARK-5571 URL: https://issues.apache.org/jira/browse/SPARK-5571 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts. It should also supporting training and prediction using text (Strings). This plan is sketched in the [original LDA design doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]. There should be: * runWithText() method which takes an RDD with a collection of Strings (bags of words). This will also index terms and compute a dictionary. * dictionary parameter for when LDA is run with word count vectors * prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which is commented out in LDA currently) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)
[ https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606166#comment-14606166 ] Ted Malaska commented on SPARK-2447: Hey Andrew, https://issues.apache.org/jira/browse/HBASE-13992 Let me know if there is anything else I can do. I would love this to get into HBase. Let me know if you want to chat off line. Add common solution for sending upsert actions to HBase (put, deletes, and increment) - Key: SPARK-2447 URL: https://issues.apache.org/jira/browse/SPARK-2447 Project: Spark Issue Type: New Feature Components: Spark Core, Streaming Reporter: Ted Malaska Assignee: Ted Malaska Going to review the design with Tdas today. But first thoughts is to have an extension of VoidFunction that handles the connection to HBase and allows for options such as turning auto flush off for higher through put. Need to answer the following questions first. - Can it be written in Java or should it be written in Scala? - What is the best way to add the HBase dependency? (will review how Flume does this as the first option) - What is the best way to do testing? (will review how Flume does this as the first option) - How to support python? (python may be a different Jira it is unknown at this time) Goals: - Simple to use - Stable - Supports high load - Documented (May be in a separate Jira need to ask Tdas) - Supports Java, Scala, and hopefully Python - Supports Streaming and normal Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6129) Add a section in user guide for model evaluation
[ https://issues.apache.org/jira/browse/SPARK-6129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606165#comment-14606165 ] Seth Hendrickson commented on SPARK-6129: - If no one else has started on this, I'd like to give it a go. Add a section in user guide for model evaluation Key: SPARK-6129 URL: https://issues.apache.org/jira/browse/SPARK-6129 Project: Spark Issue Type: New Feature Components: Documentation, MLlib Reporter: Xiangrui Meng We now have evaluation metrics for binary, multiclass, ranking, and multilabel in MLlib. It would be nice to have a section in the user guide to summarize them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8528) Add applicationId to SparkContext object in pyspark
[ https://issues.apache.org/jira/browse/SPARK-8528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-8528. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6936 [https://github.com/apache/spark/pull/6936] Add applicationId to SparkContext object in pyspark --- Key: SPARK-8528 URL: https://issues.apache.org/jira/browse/SPARK-8528 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.4.0 Reporter: Vladimir Vladimirov Priority: Minor Fix For: 1.5.0 It is available in Scala API. Our use case - we want to log applicationId (YARN in hour case) to request help with troubleshooting from the DevOps if our app had failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8528) Add applicationId to SparkContext object in pyspark
[ https://issues.apache.org/jira/browse/SPARK-8528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-8528: -- Assignee: Vladimir Vladimirov Add applicationId to SparkContext object in pyspark --- Key: SPARK-8528 URL: https://issues.apache.org/jira/browse/SPARK-8528 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.4.0 Reporter: Vladimir Vladimirov Assignee: Vladimir Vladimirov Priority: Minor Fix For: 1.5.0 It is available in Scala API. Our use case - we want to log applicationId (YARN in hour case) to request help with troubleshooting from the DevOps if our app had failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6830) Memoize frequently queried vals in RDD, such as numPartitions, count etc.
[ https://issues.apache.org/jira/browse/SPARK-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606176#comment-14606176 ] Perinkulam I Ganesh commented on SPARK-6830: This thought crossed our mind as well earlier. So we were debating whether the caching should be implemented within the cacheManager, so that the count is cached only if the underlying RDD is cached. Memoize frequently queried vals in RDD, such as numPartitions, count etc. - Key: SPARK-6830 URL: https://issues.apache.org/jira/browse/SPARK-6830 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor Labels: Starter We should memoize frequently queried vals in RDD, such as numPartitions, count etc. While using SparkR in RStudio, the `count` function seems to be called frequently by the IDE – I think this is to show some stats about variables in the workspace etc. but this is not great in SparkR as we trigger a job every time count is called. Memoization would help in this case, but we should also see if there is some better way to interact with RStudio. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8670) Nested columns can't be referenced (but they can be selected)
[ https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606178#comment-14606178 ] Nicholas Chammas commented on SPARK-8670: - FYI: `df.stats.age` works neither on 1.3 nor on 1.4. In both cases it yields this: {code} AttributeError: 'Column' object has no attribute 'age' {code} `df.selectExpr(stats.age)` does work, though. Nested columns can't be referenced (but they can be selected) - Key: SPARK-8670 URL: https://issues.apache.org/jira/browse/SPARK-8670 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0 Reporter: Nicholas Chammas This is strange and looks like a regression from 1.3. {code} import json daterz = [ { 'name': 'Nick', 'stats': { 'age': 28 } }, { 'name': 'George', 'stats': { 'age': 31 } } ] df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x))) df.select('stats.age').show() df['stats.age'] # 1.4 fails on this line {code} On 1.3 this works and yields: {code} age 28 31 Out[1]: Columnstats.age AS age#2958L {code} On 1.4, however, this gives an error on the last line: {code} +---+ |age| +---+ | 28| | 31| +---+ --- IndexErrorTraceback (most recent call last) ipython-input-1-04bd990e94c6 in module() 19 20 df.select('stats.age').show() --- 21 df['stats.age'] /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item) 678 if isinstance(item, basestring): 679 if item not in self.columns: -- 680 raise IndexError(no such column: %s % item) 681 jc = self._jdf.apply(item) 682 return Column(jc) IndexError: no such column: stats.age {code} This means, among other things, that you can't join DataFrames on nested columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8621) crosstab exception when one of the value is empty
[ https://issues.apache.org/jira/browse/SPARK-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606186#comment-14606186 ] Michael Armbrust commented on SPARK-8621: - Will you ever want to access the columns by name? Having to write {{df(\name\)}} s kind of verbose. I think I would just special case empty string as {{empty string}}, but I don't have a strong opinion here. crosstab exception when one of the value is empty - Key: SPARK-8621 URL: https://issues.apache.org/jira/browse/SPARK-8621 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Critical I think this happened because some value is empty. {code} scala df1.stat.crosstab(role, lang) org.apache.spark.sql.AnalysisException: syntax error in attribute name: ; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.parseAttributeName(LogicalPlan.scala:145) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:135) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:157) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:603) at org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:394) at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:160) at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:157) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:157) at org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:147) at org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:132) at org.apache.spark.sql.execution.stat.StatFunctions$.crossTabulate(StatFunctions.scala:132) at org.apache.spark.sql.DataFrameStatFunctions.crosstab(DataFrameStatFunctions.scala:91) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8709) Exclude hadoop-client's mockito-all dependency to fix test compilation break for Hadoop 2
Josh Rosen created SPARK-8709: - Summary: Exclude hadoop-client's mockito-all dependency to fix test compilation break for Hadoop 2 Key: SPARK-8709 URL: https://issues.apache.org/jira/browse/SPARK-8709 Project: Spark Issue Type: Bug Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen {{build/sbt -Phadoop-1 -Dhadoop.version=2.0.0-mr1-cdh4.1.1 -Phive -Pkinesis-asl -Phive-thriftserver core/test:compile}} currently fails to compile: {code} [error] /Users/joshrosen/Documents/spark/core/src/test/java/org/apache/spark/shuffle/unsafe/UnsafeShuffleWriterSuite.java:117: error: cannot find symbol [error] when(shuffleMemoryManager.tryToAcquire(anyLong())).then(returnsFirstArg()); [error] ^ [error] symbol: method then(AnswerObject) [error] location: interface OngoingStubbingLong [error] /Users/joshrosen/Documents/spark/core/src/test/java/org/apache/spark/shuffle/unsafe/UnsafeShuffleWriterSuite.java:408: error: cannot find symbol [error] .then(returnsFirstArg()) // Allocate initial sort buffer [error] ^ [error] symbol: method then(AnswerObject) [error] location: interface OngoingStubbingLong [error] /Users/joshrosen/Documents/spark/core/src/test/java/org/apache/spark/shuffle/unsafe/UnsafeShuffleWriterSuite.java:435: error: cannot find symbol [error] .then(returnsFirstArg()) // Allocate initial sort buffer [error] ^ [error] symbol: method then(AnswerObject) [error] location: interface OngoingStubbingLong [error] 3 errors [error] (core/test:compile) javac returned nonzero exit code [error] Total time: 60 s, completed Jun 29, 2015 11:03:13 AM {code} This is because {{hadoop-client}} pulls in a dependency on {{mockito-all}}, but I recently changed Spark to depend on {{mockito-core}} instead, which caused Hadoop's earlier Mockito version to take precedence over our newer version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8668) expr function to convert SQL expression into a Column
[ https://issues.apache.org/jira/browse/SPARK-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606067#comment-14606067 ] Tarek Auel commented on SPARK-8668: --- Hi, just to get it right: selectExpr of the dataframe api takes at the moment varargs as arguments. This should be enhanced in order to parse ONE string argument that contains multiple expressions, shouldn't it? Or do I get it wrong? expr function to convert SQL expression into a Column - Key: SPARK-8668 URL: https://issues.apache.org/jira/browse/SPARK-8668 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin selectExpr uses the expression parser to parse a string expressions. would be great to create an expr function in functions.scala/functions.py that converts a string into an expression (or a list of expressions separated by comma). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8668) expr function to convert SQL expression into a Column
[ https://issues.apache.org/jira/browse/SPARK-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606073#comment-14606073 ] Reynold Xin commented on SPARK-8668: This is not about selectExpr, but adding a new expr function that takes in a single string, and returns an expression. Once we do that, we can have expr and selectExpr support taking in one string, and returning multiple expressions (wrapped in a wrapper expression that the analyzer can expand). expr function to convert SQL expression into a Column - Key: SPARK-8668 URL: https://issues.apache.org/jira/browse/SPARK-8668 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin selectExpr uses the expression parser to parse a string expressions. would be great to create an expr function in functions.scala/functions.py that converts a string into an expression (or a list of expressions separated by comma). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8410) Hive VersionsSuite RuntimeException
[ https://issues.apache.org/jira/browse/SPARK-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8410: --- Assignee: Burak Yavuz (was: Apache Spark) Hive VersionsSuite RuntimeException --- Key: SPARK-8410 URL: https://issues.apache.org/jira/browse/SPARK-8410 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1, 1.4.0 Environment: IBM Power system - P7 running Ubuntu 14.04LE Reporter: Josiah Samuel Sathiadass Assignee: Burak Yavuz Priority: Minor While testing Spark Project Hive, there are RuntimeExceptions as follows, VersionsSuite: - success sanity check *** FAILED *** java.lang.RuntimeException: [download failed: org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed: org.codehaus.groovy#groovy-all;2.1.6!groovy-all.jar, download failed: asm#asm;3.2!asm.jar] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978) at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62) at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62) at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:38) at org.apache.spark.sql.hive.client.IsolatedClientLoader$.org$apache$spark$sql$hive$client$IsolatedClientLoader$$downloadVersion(IsolatedClientLoader.scala:61) at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$1.apply(IsolatedClientLoader.scala:44) at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$1.apply(IsolatedClientLoader.scala:44) at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189) at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91) at org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:44) ... The tests are executed with the following set of options, build/mvn --pl sql/hive --fail-never -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 test Adding the following dependencies in the spark/sql/hive/pom.xml file solves this issue, dependency groupIdorg.jboss.netty/groupId artifactIdnetty/artifactId version3.2.2.Final/version scopetest/scope /dependency dependency groupIdorg.codehaus.groovy/groupId artifactIdgroovy-all/artifactId version2.1.6/version scopetest/scope /dependency dependency groupIdasm/groupId artifactIdasm/artifactId version3.2/version scopetest/scope /dependency The question is, Is this the correct way to fix this runtimeException ? If yes, Can a pull request fix this issue permanently ? If not, suggestions please. Updates: The above mentioned quick fix is not working with the latest 1.4 because of this pull commits : [SPARK-8095] Resolve dependencies of --packages in local ivy cache #6788 https://github.com/apache/spark/pull/6788 Due to this above commit, now the lookup directories during testing phase has changed as follows, :: problems summary :: WARNINGS [NOT FOUND ] org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle) (2ms) local-m2-cache: tried file:/home/joe/sparkibmsoe/spark/sql/hive/dummy/.m2/repository/org/jboss/netty/netty/3.2.2.Final/netty-3.2.2.Final.jar [NOT FOUND ] org.codehaus.groovy#groovy-all;2.1.6!groovy-all.jar (0ms) local-m2-cache: tried file:/home/joe/sparkibmsoe/spark/sql/hive/dummy/.m2/repository/org/codehaus/groovy/groovy-all/2.1.6/groovy-all-2.1.6.jar [NOT FOUND ] asm#asm;3.2!asm.jar (0ms) local-m2-cache: tried file:/home/joe/sparkibmsoe/spark/sql/hive/dummy/.m2/repository/asm/asm/3.2/asm-3.2.jar :: :: FAILED DOWNLOADS:: :: ^ see resolution messages for details ^ :: :: :: org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle) :: org.codehaus.groovy#groovy-all;2.1.6!groovy-all.jar :: asm#asm;3.2!asm.jar :: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8475) SparkSubmit with Ivy jars is very slow to load with no internet access
[ https://issues.apache.org/jira/browse/SPARK-8475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8475: --- Assignee: (was: Apache Spark) SparkSubmit with Ivy jars is very slow to load with no internet access -- Key: SPARK-8475 URL: https://issues.apache.org/jira/browse/SPARK-8475 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 1.4.0 Reporter: Nathan McCarthy Priority: Minor Spark Submit adds maven central spark bintray to the ChainResolver before it adds any external resolvers. https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L821 When running on a cluster without internet access, this means the spark shell takes forever to launch as it tries these two remote repos before the ones specified in the --repositories list. In our case we have a proxy which the cluster can access it and supply it via --repositories. This is also a problem for users who maintain a proxy for maven/ivy repos with something like Nexus/Artifactory. Having a repo proxy is popular at many organisations so I'd say this would be a useful change for these users as well. In the current state even if a maven central proxy is supplied, it will still try and hit central. I see two options for a fix; * Change the order repos are added to the ChainResolver, making the --repositories supplied repos come before anything else. https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L843 * Have a config option (like spark.jars.ivy.useDefaultRemoteRepos, default true) which when false wont add the maven central bintry to the ChainResolver. Happy to do a PR for this fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org