[jira] [Assigned] (SPARK-8374) Job frequently hangs after YARN preemption

2015-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8374:
---

Assignee: (was: Apache Spark)

 Job frequently hangs after YARN preemption
 --

 Key: SPARK-8374
 URL: https://issues.apache.org/jira/browse/SPARK-8374
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.0
 Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04
Reporter: Shay Rojansky
Priority: Critical

 After upgrading to Spark 1.4.0, jobs that get preempted very frequently will 
 not reacquire executors and will therefore hang. To reproduce:
 1. I run Spark job A that acquires all grid resources
 2. I run Spark job B in a higher-priority queue that acquires all grid 
 resources. Job A is fully preempted.
 3. Kill job B, releasing all resources
 4. Job A should at this point reacquire all grid resources, but occasionally 
 doesn't. Repeating the preemption scenario makes the bad behavior occur 
 within a few attempts.
 (see logs at bottom).
 Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption 
 issues, maybe the work there is related to the new issues.
 The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've 
 downgraded to 1.3.1 just because of this issue).
 Logs
 --
 When job B (the preemptor first acquires an application master, the following 
 is logged by job A (the preemptee):
 {noformat}
 ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc 
 client disassociated
 INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0
 WARN ReliableDeliverySupervisor: Association with remote system 
 [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address 
 is now gated for [5000] ms. Reason is: [Disassociated].
 WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, 
 g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost)
 INFO DAGScheduler: Executor lost: 447 (epoch 0)
 INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from 
 BlockManagerMaster.
 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, 
 g023.grid.eaglerd.local, 41406)
 INFO BlockManagerMaster: Removed 447 successfully in removeExecutor
 {noformat}
 (It's strange for errors/warnings to be logged for preemption)
 Later, when job B's AM starts requesting its resources, I get lots of the 
 following in job A:
 {noformat}
 ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc 
 client disassociated
 INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0
 WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, 
 g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost)
 WARN ReliableDeliverySupervisor: Association with remote system 
 [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address 
 is now gated for [5000] ms. Reason is: [Disassociated].
 {noformat}
 Finally, when I kill job B, job A emits lots of the following:
 {noformat}
 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31
 WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist!
 {noformat}
 And finally after some time:
 {noformat}
 WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 
 165964 ms exceeds timeout 12 ms
 ERROR YarnScheduler: Lost an executor 466 (already removed): Executor 
 heartbeat timed out after 165964 ms
 {noformat}
 At this point the job never requests/acquires more resources and hangs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8374) Job frequently hangs after YARN preemption

2015-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8374:
---

Assignee: Apache Spark

 Job frequently hangs after YARN preemption
 --

 Key: SPARK-8374
 URL: https://issues.apache.org/jira/browse/SPARK-8374
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.0
 Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04
Reporter: Shay Rojansky
Assignee: Apache Spark
Priority: Critical

 After upgrading to Spark 1.4.0, jobs that get preempted very frequently will 
 not reacquire executors and will therefore hang. To reproduce:
 1. I run Spark job A that acquires all grid resources
 2. I run Spark job B in a higher-priority queue that acquires all grid 
 resources. Job A is fully preempted.
 3. Kill job B, releasing all resources
 4. Job A should at this point reacquire all grid resources, but occasionally 
 doesn't. Repeating the preemption scenario makes the bad behavior occur 
 within a few attempts.
 (see logs at bottom).
 Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption 
 issues, maybe the work there is related to the new issues.
 The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've 
 downgraded to 1.3.1 just because of this issue).
 Logs
 --
 When job B (the preemptor first acquires an application master, the following 
 is logged by job A (the preemptee):
 {noformat}
 ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc 
 client disassociated
 INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0
 WARN ReliableDeliverySupervisor: Association with remote system 
 [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address 
 is now gated for [5000] ms. Reason is: [Disassociated].
 WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, 
 g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost)
 INFO DAGScheduler: Executor lost: 447 (epoch 0)
 INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from 
 BlockManagerMaster.
 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, 
 g023.grid.eaglerd.local, 41406)
 INFO BlockManagerMaster: Removed 447 successfully in removeExecutor
 {noformat}
 (It's strange for errors/warnings to be logged for preemption)
 Later, when job B's AM starts requesting its resources, I get lots of the 
 following in job A:
 {noformat}
 ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc 
 client disassociated
 INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0
 WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, 
 g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost)
 WARN ReliableDeliverySupervisor: Association with remote system 
 [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address 
 is now gated for [5000] ms. Reason is: [Disassociated].
 {noformat}
 Finally, when I kill job B, job A emits lots of the following:
 {noformat}
 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31
 WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist!
 {noformat}
 And finally after some time:
 {noformat}
 WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 
 165964 ms exceeds timeout 12 ms
 ERROR YarnScheduler: Lost an executor 466 (already removed): Executor 
 heartbeat timed out after 165964 ms
 {noformat}
 At this point the job never requests/acquires more resources and hangs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8704) Add additional methods to wrappers in ml.pyspark.feature

2015-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8704:
---

Assignee: Apache Spark

 Add additional methods to wrappers in ml.pyspark.feature
 

 Key: SPARK-8704
 URL: https://issues.apache.org/jira/browse/SPARK-8704
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: Manoj Kumar
Assignee: Apache Spark

 std, mean to StandardScalerModel
 getVectors, findSynonyms to Word2Vec Model
 setFeatures and getFeatures to hashingTF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8704) Add additional methods to wrappers in ml.pyspark.feature

2015-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8704:
---

Assignee: (was: Apache Spark)

 Add additional methods to wrappers in ml.pyspark.feature
 

 Key: SPARK-8704
 URL: https://issues.apache.org/jira/browse/SPARK-8704
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: Manoj Kumar

 std, mean to StandardScalerModel
 getVectors, findSynonyms to Word2Vec Model
 setFeatures and getFeatures to hashingTF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8337) KafkaUtils.createDirectStream for python is lacking API/feature parity with the Scala/Java version

2015-06-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605588#comment-14605588
 ] 

Juan Rodríguez Hortalá commented on SPARK-8337:
---

Hi [~jerryshao], 

That is a good idea, I should had paid more attention to the discussion in the 
duplicated issue. I will try that way, and tell you how it went. 

Greetings, 

Juan

 KafkaUtils.createDirectStream for python is lacking API/feature parity with 
 the Scala/Java version
 --

 Key: SPARK-8337
 URL: https://issues.apache.org/jira/browse/SPARK-8337
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Streaming
Affects Versions: 1.4.0
Reporter: Amit Ramesh
Priority: Critical

 See the following thread for context.
 http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Spark-1-4-Python-API-for-getting-Kafka-offsets-in-direct-mode-tt12714.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6599) Improve reliability and usability of Kinesis-based Spark Streaming

2015-06-29 Thread Przemyslaw Pastuszka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605591#comment-14605591
 ] 

Przemyslaw Pastuszka commented on SPARK-6599:
-

Is there any work being done on this? Can I help somehow?

 Improve reliability and usability of Kinesis-based Spark Streaming
 --

 Key: SPARK-6599
 URL: https://issues.apache.org/jira/browse/SPARK-6599
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das

 Currently, the KinesisReceiver can loose some data in the case of certain 
 failures (receiver and driver failures). Using the write ahead logs can 
 mitigate some of the problem, but it is not ideal because WALs dont work with 
 S3 (eventually consistency, etc.) which is the most likely file system to be 
 used in the EC2 environment. Hence, we have to take a different approach to 
 improving reliability for Kinesis.
 A detailed design doc on how this can be achieved will be added later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2015-06-29 Thread Lars Francke (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605657#comment-14605657
 ] 

Lars Francke commented on SPARK-2447:
-

Hey Ted et. al,

thanks for the work on this. SparkOnHBase is super useful and clients are 
happily using it.

I wonder however what the future direction will be. Any progress on the 
question whether it's going to be integrated into Spark or not? I don't have a 
strong opinion either but I also don't feel that it would be _wrong_ to put it 
into core...

 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8693) profiles and goals are not printed in a nice way

2015-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8693:
---

Assignee: (was: Apache Spark)

 profiles and goals are not printed in a nice way
 

 Key: SPARK-8693
 URL: https://issues.apache.org/jira/browse/SPARK-8693
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Project Infra
Reporter: Yin Huai
Priority: Minor

 In our master build, I see
 {code}
 -Phadoop-1[info] Building Spark (w/Hive 0.13.1) using SBT with these 
 arguments:  -Dhadoop.version=1.0.4[info] Building Spark (w/Hive 0.13.1) using 
 SBT with these arguments:  -Pkinesis-asl[info] Building Spark (w/Hive 0.13.1) 
 using SBT with these arguments:  -Phive-thriftserver[info] Building Spark 
 (w/Hive 0.13.1) using SBT with these arguments:  -Phive[info] Building Spark 
 (w/Hive 0.13.1) using SBT with these arguments:  package[info] Building Spark 
 (w/Hive 0.13.1) using SBT with these arguments:  assembly/assembly[info] 
 Building Spark (w/Hive 0.13.1) using SBT with these arguments:  
 streaming-kafka-assembly/assembly
 {code}
 Seems we format the string in a wrong way?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8693) profiles and goals are not printed in a nice way

2015-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8693:
---

Assignee: Apache Spark

 profiles and goals are not printed in a nice way
 

 Key: SPARK-8693
 URL: https://issues.apache.org/jira/browse/SPARK-8693
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Project Infra
Reporter: Yin Huai
Assignee: Apache Spark
Priority: Minor

 In our master build, I see
 {code}
 -Phadoop-1[info] Building Spark (w/Hive 0.13.1) using SBT with these 
 arguments:  -Dhadoop.version=1.0.4[info] Building Spark (w/Hive 0.13.1) using 
 SBT with these arguments:  -Pkinesis-asl[info] Building Spark (w/Hive 0.13.1) 
 using SBT with these arguments:  -Phive-thriftserver[info] Building Spark 
 (w/Hive 0.13.1) using SBT with these arguments:  -Phive[info] Building Spark 
 (w/Hive 0.13.1) using SBT with these arguments:  package[info] Building Spark 
 (w/Hive 0.13.1) using SBT with these arguments:  assembly/assembly[info] 
 Building Spark (w/Hive 0.13.1) using SBT with these arguments:  
 streaming-kafka-assembly/assembly
 {code}
 Seems we format the string in a wrong way?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8636) CaseKeyWhen has incorrect NULL handling

2015-06-29 Thread Santiago M. Mola (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605557#comment-14605557
 ] 

Santiago M. Mola commented on SPARK-8636:
-

[~davies], [~animeshbaranawal] In SQL, NULL is never equal to NULL. Any 
comparison to NULL is UNKNOWN. Most SQL implementations represent UNKNOWN as 
NULL, too.

 CaseKeyWhen has incorrect NULL handling
 ---

 Key: SPARK-8636
 URL: https://issues.apache.org/jira/browse/SPARK-8636
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Santiago M. Mola
  Labels: starter

 CaseKeyWhen implementation in Spark uses the following equals implementation:
 {code}
   private def equalNullSafe(l: Any, r: Any) = {
 if (l == null  r == null) {
   true
 } else if (l == null || r == null) {
   false
 } else {
   l == r
 }
   }
 {code}
 Which is not correct, since in SQL, NULL is never equal to NULL (actually, it 
 is not unequal either). In this case, a NULL value in a CASE WHEN expression 
 should never match.
 For example, you can execute this in MySQL:
 {code}
 SELECT CASE NULL WHEN NULL THEN NULL MATCHES ELSE NULL DOES NOT MATCH END 
 FROM DUAL;
 {code}
 And the result will be NULL DOES NOT MATCH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8693) profiles and goals are not printed in a nice way

2015-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605606#comment-14605606
 ] 

Apache Spark commented on SPARK-8693:
-

User 'brennonyork' has created a pull request for this issue:
https://github.com/apache/spark/pull/7085

 profiles and goals are not printed in a nice way
 

 Key: SPARK-8693
 URL: https://issues.apache.org/jira/browse/SPARK-8693
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Project Infra
Reporter: Yin Huai
Priority: Minor

 In our master build, I see
 {code}
 -Phadoop-1[info] Building Spark (w/Hive 0.13.1) using SBT with these 
 arguments:  -Dhadoop.version=1.0.4[info] Building Spark (w/Hive 0.13.1) using 
 SBT with these arguments:  -Pkinesis-asl[info] Building Spark (w/Hive 0.13.1) 
 using SBT with these arguments:  -Phive-thriftserver[info] Building Spark 
 (w/Hive 0.13.1) using SBT with these arguments:  -Phive[info] Building Spark 
 (w/Hive 0.13.1) using SBT with these arguments:  package[info] Building Spark 
 (w/Hive 0.13.1) using SBT with these arguments:  assembly/assembly[info] 
 Building Spark (w/Hive 0.13.1) using SBT with these arguments:  
 streaming-kafka-assembly/assembly
 {code}
 Seems we format the string in a wrong way?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8704) Add additional methods to wrappers in ml.pyspark.feature

2015-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605611#comment-14605611
 ] 

Apache Spark commented on SPARK-8704:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/7086

 Add additional methods to wrappers in ml.pyspark.feature
 

 Key: SPARK-8704
 URL: https://issues.apache.org/jira/browse/SPARK-8704
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: Manoj Kumar

 std, mean to StandardScalerModel
 getVectors, findSynonyms to Word2Vec Model
 setFeatures and getFeatures to hashingTF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8666) checkpointing does not take advantage of persisted/cached RDDs

2015-06-29 Thread Glenn Strycker (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glenn Strycker closed SPARK-8666.
-

 checkpointing does not take advantage of persisted/cached RDDs
 --

 Key: SPARK-8666
 URL: https://issues.apache.org/jira/browse/SPARK-8666
 Project: Spark
  Issue Type: New Feature
Reporter: Glenn Strycker

 I have been noticing that when checkpointing RDDs, all operations are 
 occurring TWICE.
 For example, when I run the following code and watch the stages...
 {noformat}
 val newRDD = prevRDD.map(a = (a._1, 
 1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER)
 newRDD.checkpoint
 print(newRDD.count())
 {noformat}
 I see distinct and count operations appearing TWICE, and shuffle disk writes 
 and reads (from the distinct) occurring TWICE.
 My newRDD is persisted to memory, why can't the checkpoint simply save those 
 partitions to disk when the first operations have completed?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8705) Javascript error in the web console when `totalExecutionTime` of a task is 0

2015-06-29 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-8705:
---

 Summary: Javascript error in the web console when 
`totalExecutionTime` of a task is 0
 Key: SPARK-8705
 URL: https://issues.apache.org/jira/browse/SPARK-8705
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Shixiong Zhu


Because System.currentTimeMillis() is not accurate for tasks that only need 
several milliseconds, sometimes totalExecutionTime in makeTimeline will be 0. 
If totalExecutionTime is 0, there will the following error in the console.

!https://cloud.githubusercontent.com/assets/1000778/8406776/5cd38e04-1e92-11e5-89f2-0c5134fe4b6b.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2015-06-29 Thread Ted Malaska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605663#comment-14605663
 ] 

Ted Malaska commented on SPARK-2447:


Yeah, I have talked a lot with TD (Spark), Job H(HBase), Stacks(HBase) about 
this.  Nether thing HBase or Spark is the right project to put it in.

Right now the code is in Cloudera Labs and a github and works for CDH 5.3 and 
5.4 we have a number of clients on it.

There is talk to make it an apache project.  It is apache listened but it would 
be nice to put it under apache totally.  The problem is it is soo simple 
some times it feels to small to be it's own project.  

The design is just to have a HBase connection in a static location in the 
executor.

I know other NoSql brag about local gets, but HBase already had that even 
without SparkOnHBase.  The Table input format already gives you local gets.

All Spark on HBase gives you is an active connection that can be accessed in 
the distributed function of Spark.  Which is very important to some use cases.  
Like Spark Streaming and complex graph local.

Let me know.  We are open to ideas.

 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8704) Add additional methods to wrappers in ml.pyspark.feature

2015-06-29 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-8704:
--

 Summary: Add additional methods to wrappers in ml.pyspark.feature
 Key: SPARK-8704
 URL: https://issues.apache.org/jira/browse/SPARK-8704
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: Manoj Kumar


std, mean to StandardScalerModel
getVectors, findSynonyms to Word2Vec Model
setFeatures and getFeatures to hashingTF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict

2015-06-29 Thread Rakesh Chalasani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605639#comment-14605639
 ] 

Rakesh Chalasani commented on SPARK-8587:
-

Hi Sam,

computeCost now returns  the cumulative cost over a dataset, rather than cost 
per sample, which i think this JIRA is for. Internally, predict does compute 
the distance to nearest point but return only the predicted center. So, adding 
a method that returns distances is doing the job twice and that is what is 
pointed above for Bradley. In Pipelines, on the other hand, this can handled 
more gracefully and efficiently by adding a column to the returning DF. 

If that is good for you, can you close this JIRA? I will create another one for 
adding distances to the KMeans pipeline, once that is merged. thanks.

 Return cost and cluster index KMeansModel.predict
 -

 Key: SPARK-8587
 URL: https://issues.apache.org/jira/browse/SPARK-8587
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sam Stoelinga
Priority: Minor

 Looking at PySpark the implementation of KMeansModel.predict 
 https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102
  : 
 Currently:
 it calculates the cost of the closest cluster and returns the index only.
 My expectation:
 Easy way to let the same function or a new function to return the cost with 
 the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8705) Javascript error in the web console when `totalExecutionTime` of a task is 0

2015-06-29 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605646#comment-14605646
 ] 

Shixiong Zhu commented on SPARK-8705:
-

A simple fix is don't add {{rect}} s to {{svg}} when {{totalExecutionTime}} is 
0 in 
https://github.com/apache/spark/blob/04ddcd4db7801abefa9c9effe5d88413b29d713b/core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala#L599
 

This conflicts with https://github.com/apache/spark/pull/7082 , so I will send 
a PR after pr #7082 is merged.

 Javascript error in the web console when `totalExecutionTime` of a task is 0
 

 Key: SPARK-8705
 URL: https://issues.apache.org/jira/browse/SPARK-8705
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Shixiong Zhu

 Because System.currentTimeMillis() is not accurate for tasks that only need 
 several milliseconds, sometimes totalExecutionTime in makeTimeline will be 0. 
 If totalExecutionTime is 0, there will the following error in the console.
 !https://cloud.githubusercontent.com/assets/1000778/8406776/5cd38e04-1e92-11e5-89f2-0c5134fe4b6b.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8660) Update comments that contain R statements in ml.logisticRegressionSuite

2015-06-29 Thread somil deshmukh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605481#comment-14605481
 ] 

somil deshmukh commented on SPARK-8660:
---

In LogisticRegressionSuite.class ,I will replace  comment  /** with /*  .like 
this

/*
  Using the following R code to load the data and train the model using 
glmnet package.

  library(glmnet)
  data - read.csv(path, header=FALSE)
  label = factor(data$V1)
  features = as.matrix(data.frame(data$V2, data$V3, data$V4, data$V5))
  weights = coef(glmnet(features,label, family=binomial, alpha = 0, 
lambda = 0))
  weights

  5 x 1 sparse Matrix of class dgCMatrix
 s0
  (Intercept)  2.8366423
  data.V2 -0.5895848
  data.V3  0.8931147
  data.V4 -0.3925051
  data.V5 -0.7996864
 */

 Update comments that contain R statements in ml.logisticRegressionSuite
 ---

 Key: SPARK-8660
 URL: https://issues.apache.org/jira/browse/SPARK-8660
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Priority: Trivial
  Labels: starter
   Original Estimate: 20m
  Remaining Estimate: 20m

 We put R statements as comments in unit test. However, there are two issues:
 1. JavaDoc style /** ... */ is used instead of normal multiline comment /* 
 ... */.
 2. We put a leading * on each line. It is hard to copy  paste the commands 
 to/from R and verify the result.
 For example, in 
 https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala#L504
 {code}
 /**
  * Using the following R code to load the data and train the model using 
 glmnet package.
  *
  *  library(glmnet)
  *  data - read.csv(path, header=FALSE)
  *  label = factor(data$V1)
  *  features = as.matrix(data.frame(data$V2, data$V3, data$V4, data$V5))
  *  weights = coef(glmnet(features,label, family=binomial, alpha = 
 1.0, lambda = 6.0))
  *  weights
  * 5 x 1 sparse Matrix of class dgCMatrix
  *  s0
  * (Intercept) -0.2480643
  * data.V2  0.000
  * data.V3   .
  * data.V4   .
  * data.V5   .
  */
 {code}
 should change to
 {code}
 /*
   Using the following R code to load the data and train the model using 
 glmnet package.
  
   library(glmnet)
   data - read.csv(path, header=FALSE)
   label = factor(data$V1)
   features = as.matrix(data.frame(data$V2, data$V3, data$V4, data$V5))
   weights = coef(glmnet(features,label, family=binomial, alpha = 1.0, 
 lambda = 6.0))
   weights
   5 x 1 sparse Matrix of class dgCMatrix
s0
   (Intercept) -0.2480643
   data.V2  0.000
   data.V3   .
   data.V4   .
   data.V5   .
 */
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8703) Add CountVectorizer as a ml transformer to convert document to words count vector

2015-06-29 Thread yuhao yang (JIRA)
yuhao yang created SPARK-8703:
-

 Summary: Add CountVectorizer as a ml transformer to convert 
document to words count vector
 Key: SPARK-8703
 URL: https://issues.apache.org/jira/browse/SPARK-8703
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: yuhao yang


Converts a text document to a sparse vector of token counts.

I can further add an estimator to extract vocabulary from corpus if that's 
appropriate.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8702) Avoid massive concating strings in Javascript

2015-06-29 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-8702:
---

 Summary: Avoid massive concating strings in Javascript
 Key: SPARK-8702
 URL: https://issues.apache.org/jira/browse/SPARK-8702
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Shixiong Zhu


When there are massive tasks, such as {{sc.parallelize(1 to 10, 
1).count()}}, the generated JS codes have a lot of string concatenations in 
the stage page, nearly 40 string concatenations for one task.

We can generate the whole string for a task instead of execution string 
concatenations in the browser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7398) Add back-pressure to Spark Streaming

2015-06-29 Thread Iulian Dragos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Iulian Dragos updated SPARK-7398:
-
Description: 
Spark Streaming has trouble dealing with situations where 
 batch processing time  batch interval
Meaning a high throughput of input data w.r.t. Spark's ability to remove data 
from the queue.

If this throughput is sustained for long enough, it leads to an unstable 
situation where the memory of the Receiver's Executor is overflowed.

This aims at transmitting a back-pressure signal back to data ingestion to help 
with dealing with that high throughput, in a backwards-compatible way.

The original design doc can be found here:
https://docs.google.com/document/d/1ZhiP_yBHcbjifz8nJEyPJpHqxB1FT6s8-Zk7sAfayQw/edit?usp=sharing

The second design doc (without all the background info, and more centered on 
the implementation) can be found here:
https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing

  was:
Spark Streaming has trouble dealing with situations where 
 batch processing time  batch interval
Meaning a high throughput of input data w.r.t. Spark's ability to remove data 
from the queue.

If this throughput is sustained for long enough, it leads to an unstable 
situation where the memory of the Receiver's Executor is overflowed.

This aims at transmitting a back-pressure signal back to data ingestion to help 
with dealing with that high throughput, in a backwards-compatible way.

The design doc can be found here:
https://docs.google.com/document/d/1ZhiP_yBHcbjifz8nJEyPJpHqxB1FT6s8-Zk7sAfayQw/edit?usp=sharing


 Add back-pressure to Spark Streaming
 

 Key: SPARK-7398
 URL: https://issues.apache.org/jira/browse/SPARK-7398
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.1
Reporter: François Garillot
Priority: Critical
  Labels: streams

 Spark Streaming has trouble dealing with situations where 
  batch processing time  batch interval
 Meaning a high throughput of input data w.r.t. Spark's ability to remove data 
 from the queue.
 If this throughput is sustained for long enough, it leads to an unstable 
 situation where the memory of the Receiver's Executor is overflowed.
 This aims at transmitting a back-pressure signal back to data ingestion to 
 help with dealing with that high throughput, in a backwards-compatible way.
 The original design doc can be found here:
 https://docs.google.com/document/d/1ZhiP_yBHcbjifz8nJEyPJpHqxB1FT6s8-Zk7sAfayQw/edit?usp=sharing
 The second design doc (without all the background info, and more centered on 
 the implementation) can be found here:
 https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7398) Add back-pressure to Spark Streaming

2015-06-29 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605338#comment-14605338
 ] 

Iulian Dragos commented on SPARK-7398:
--

[~tdas] here it is: 
https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing

 Add back-pressure to Spark Streaming
 

 Key: SPARK-7398
 URL: https://issues.apache.org/jira/browse/SPARK-7398
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.1
Reporter: François Garillot
Priority: Critical
  Labels: streams

 Spark Streaming has trouble dealing with situations where 
  batch processing time  batch interval
 Meaning a high throughput of input data w.r.t. Spark's ability to remove data 
 from the queue.
 If this throughput is sustained for long enough, it leads to an unstable 
 situation where the memory of the Receiver's Executor is overflowed.
 This aims at transmitting a back-pressure signal back to data ingestion to 
 help with dealing with that high throughput, in a backwards-compatible way.
 The design doc can be found here:
 https://docs.google.com/document/d/1ZhiP_yBHcbjifz8nJEyPJpHqxB1FT6s8-Zk7sAfayQw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8374) Job frequently hangs after YARN preemption

2015-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605457#comment-14605457
 ] 

Apache Spark commented on SPARK-8374:
-

User 'xuchenCN' has created a pull request for this issue:
https://github.com/apache/spark/pull/7083

 Job frequently hangs after YARN preemption
 --

 Key: SPARK-8374
 URL: https://issues.apache.org/jira/browse/SPARK-8374
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.0
 Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04
Reporter: Shay Rojansky
Priority: Critical

 After upgrading to Spark 1.4.0, jobs that get preempted very frequently will 
 not reacquire executors and will therefore hang. To reproduce:
 1. I run Spark job A that acquires all grid resources
 2. I run Spark job B in a higher-priority queue that acquires all grid 
 resources. Job A is fully preempted.
 3. Kill job B, releasing all resources
 4. Job A should at this point reacquire all grid resources, but occasionally 
 doesn't. Repeating the preemption scenario makes the bad behavior occur 
 within a few attempts.
 (see logs at bottom).
 Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption 
 issues, maybe the work there is related to the new issues.
 The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've 
 downgraded to 1.3.1 just because of this issue).
 Logs
 --
 When job B (the preemptor first acquires an application master, the following 
 is logged by job A (the preemptee):
 {noformat}
 ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc 
 client disassociated
 INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0
 WARN ReliableDeliverySupervisor: Association with remote system 
 [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address 
 is now gated for [5000] ms. Reason is: [Disassociated].
 WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, 
 g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost)
 INFO DAGScheduler: Executor lost: 447 (epoch 0)
 INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from 
 BlockManagerMaster.
 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, 
 g023.grid.eaglerd.local, 41406)
 INFO BlockManagerMaster: Removed 447 successfully in removeExecutor
 {noformat}
 (It's strange for errors/warnings to be logged for preemption)
 Later, when job B's AM starts requesting its resources, I get lots of the 
 following in job A:
 {noformat}
 ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc 
 client disassociated
 INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0
 WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, 
 g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost)
 WARN ReliableDeliverySupervisor: Association with remote system 
 [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address 
 is now gated for [5000] ms. Reason is: [Disassociated].
 {noformat}
 Finally, when I kill job B, job A emits lots of the following:
 {noformat}
 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31
 WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist!
 {noformat}
 And finally after some time:
 {noformat}
 WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 
 165964 ms exceeds timeout 12 ms
 ERROR YarnScheduler: Lost an executor 466 (already removed): Executor 
 heartbeat timed out after 165964 ms
 {noformat}
 At this point the job never requests/acquires more resources and hangs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8310) Spark EC2 branch in 1.4 is wrong

2015-06-29 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605411#comment-14605411
 ] 

Daniel Darabos commented on SPARK-8310:
---

It's an easy mistake to make, and one of the few things that are not covered by 
the release candidate process. We tested the release candidate on EC2, but we 
had to specifically override the version, since at that point there was no 
released 1.4.0. I have no idea how this could be avoided for future releases.

 Spark EC2 branch in 1.4 is wrong
 

 Key: SPARK-8310
 URL: https://issues.apache.org/jira/browse/SPARK-8310
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical
 Fix For: 1.4.1, 1.5.0


 It points to `branch-1.3` of spark-ec2 right now while it should point to 
 `branch-1.4`
 cc [~brdwrd] [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8374) Job frequently hangs after YARN preemption

2015-06-29 Thread Xu Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605422#comment-14605422
 ] 

Xu Chen commented on SPARK-8374:


Seems AM didn't add ContainerRequest after resource has been preempted
I can provide a path for this issue , could you help me to test it ?



 Job frequently hangs after YARN preemption
 --

 Key: SPARK-8374
 URL: https://issues.apache.org/jira/browse/SPARK-8374
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.0
 Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04
Reporter: Shay Rojansky
Priority: Critical

 After upgrading to Spark 1.4.0, jobs that get preempted very frequently will 
 not reacquire executors and will therefore hang. To reproduce:
 1. I run Spark job A that acquires all grid resources
 2. I run Spark job B in a higher-priority queue that acquires all grid 
 resources. Job A is fully preempted.
 3. Kill job B, releasing all resources
 4. Job A should at this point reacquire all grid resources, but occasionally 
 doesn't. Repeating the preemption scenario makes the bad behavior occur 
 within a few attempts.
 (see logs at bottom).
 Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption 
 issues, maybe the work there is related to the new issues.
 The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've 
 downgraded to 1.3.1 just because of this issue).
 Logs
 --
 When job B (the preemptor first acquires an application master, the following 
 is logged by job A (the preemptee):
 {noformat}
 ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc 
 client disassociated
 INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0
 WARN ReliableDeliverySupervisor: Association with remote system 
 [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address 
 is now gated for [5000] ms. Reason is: [Disassociated].
 WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, 
 g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost)
 INFO DAGScheduler: Executor lost: 447 (epoch 0)
 INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from 
 BlockManagerMaster.
 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, 
 g023.grid.eaglerd.local, 41406)
 INFO BlockManagerMaster: Removed 447 successfully in removeExecutor
 {noformat}
 (It's strange for errors/warnings to be logged for preemption)
 Later, when job B's AM starts requesting its resources, I get lots of the 
 following in job A:
 {noformat}
 ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc 
 client disassociated
 INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0
 WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, 
 g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost)
 WARN ReliableDeliverySupervisor: Association with remote system 
 [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address 
 is now gated for [5000] ms. Reason is: [Disassociated].
 {noformat}
 Finally, when I kill job B, job A emits lots of the following:
 {noformat}
 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31
 WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist!
 {noformat}
 And finally after some time:
 {noformat}
 WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 
 165964 ms exceeds timeout 12 ms
 ERROR YarnScheduler: Lost an executor 466 (already removed): Executor 
 heartbeat timed out after 165964 ms
 {noformat}
 At this point the job never requests/acquires more resources and hangs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8661) Update comments that contain R statements in ml.LinearRegressionSuite

2015-06-29 Thread somil deshmukh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605483#comment-14605483
 ] 

somil deshmukh commented on SPARK-8661:
---

In LinearRegressionSuite.class,I can replace /** with /* ,like this 

/*
  Using the following R code to load the data and train the model using 
glmnet package.

 library(glmnet)
 data - read.csv(path, header=FALSE, stringsAsFactors=FALSE)
 features - as.matrix(data.frame(as.numeric(data$V2), as.numeric(data$V3)))
 label - as.numeric(data$V1)
 weights - coef(glmnet(features, label, family=gaussian, alpha = 0, 
lambda = 0)) 
 weights

 3 x 1 sparse Matrix of class dgCMatrix
 s0
 (Intercept) 6.300528
 as.numeric.data.V2. 4.701024
 as.numeric.data.V3. 7.198257
 */

Do you want to remove /** for each method ,or specific this method only ?

 Update comments that contain R statements in ml.LinearRegressionSuite
 -

 Key: SPARK-8661
 URL: https://issues.apache.org/jira/browse/SPARK-8661
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
  Labels: starter
   Original Estimate: 20m
  Remaining Estimate: 20m

 Similar to SPARK-8660, but for ml.LinearRegressionSuite: 
 https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8703) Add CountVectorizer as a ml transformer to convert document to words count vector

2015-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605503#comment-14605503
 ] 

Apache Spark commented on SPARK-8703:
-

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/7084

 Add CountVectorizer as a ml transformer to convert document to words count 
 vector
 -

 Key: SPARK-8703
 URL: https://issues.apache.org/jira/browse/SPARK-8703
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: yuhao yang
   Original Estimate: 24h
  Remaining Estimate: 24h

 Converts a text document to a sparse vector of token counts.
 I can further add an estimator to extract vocabulary from corpus if that's 
 appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8702) Avoid massive concating strings in Javascript

2015-06-29 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-8702:
--
Assignee: Shixiong Zhu

 Avoid massive concating strings in Javascript
 -

 Key: SPARK-8702
 URL: https://issues.apache.org/jira/browse/SPARK-8702
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu

 When there are massive tasks, such as {{sc.parallelize(1 to 10, 
 1).count()}}, the generated JS codes have a lot of string concatenations 
 in the stage page, nearly 40 string concatenations for one task.
 We can generate the whole string for a task instead of execution string 
 concatenations in the browser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8385) java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation

2015-06-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605712#comment-14605712
 ] 

Ángel Álvarez commented on SPARK-8385:
--

A simple WordCount test worked fine in my Eclipse environment with Spark 1.4 
(in both, local and yarn-cluster modes). Make sure you don't have any reference 
to the previous 1.3 version in your project and launch configuration.



 java.lang.UnsupportedOperationException: Not implemented by the TFS 
 FileSystem implementation
 -

 Key: SPARK-8385
 URL: https://issues.apache.org/jira/browse/SPARK-8385
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.4.0
 Environment: RHEL 7.1
Reporter: Peter Haumer

 I used to be able to debug my Spark apps in Eclipse. With Spark 1.3.1 I 
 created a launch and just set the vm var -Dspark.master=local[4].  
 With 1.4 this stopped working when reading files from the OS filesystem. 
 Running the same apps with spark-submit works fine.  Loosing the ability to 
 debug that way has a major impact on the usability of Spark.
 The following exception is thrown:
 Exception in thread main java.lang.UnsupportedOperationException: Not 
 implemented by the TFS FileSystem implementation
 at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:213)
 at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2401)
 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2411)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166)
 at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:653)
 at 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:389)
 at 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362)
 at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762)
 at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at scala.Option.map(Option.scala:145)
 at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1535)
 at org.apache.spark.rdd.RDD.reduce(RDD.scala:900)
 at org.apache.spark.api.java.JavaRDDLike$class.reduce(JavaRDDLike.scala:357)
 at org.apache.spark.api.java.AbstractJavaRDDLike.reduce(JavaRDDLike.scala:46)
 at com.databricks.apps.logs.LogAnalyzer.main(LogAnalyzer.java:60)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7894) Graph Union Operator

2015-06-29 Thread Arnab (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605739#comment-14605739
 ] 

Arnab commented on SPARK-7894:
--

Short description of changes:

Introduced union functionality in EdgeRDD, VertexRDD and Graph classes (there 
is no union functionality in EdgeRdd and VertexRdd directly as pointed out by 
shijinkui)
Added code for merging partitions in Edge and Vertex partitions
Added test case for graph union (as in Jira) , also unit tests for union of 
edges and vertices

 Graph Union Operator
 

 Key: SPARK-7894
 URL: https://issues.apache.org/jira/browse/SPARK-7894
 Project: Spark
  Issue Type: Sub-task
  Components: GraphX
Reporter: Andy Huang
  Labels: graph, union
 Attachments: union_operator.png


 This operator aims to union two graphs and generate a new graph directly. The 
 union of two graphs is the union of their vertex sets and their edge 
 families.Vertexes and edges which are included in either graph will be part 
 of the new graph.
 bq.  G ∪ H = (VG ∪ VH, EG ∪ EH).
 The below image shows a union of graph G and graph H
 !union_operator.png|width=600px,align=center!
 A Simple interface would be:
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED]): Graph[VD, ED]
 However, inevitably vertexes and edges overlapping will happen between 
 borders of graphs. For vertex, it's quite nature to just make a union and 
 remove those duplicate ones. But for edges, a mergeEdges function seems to be 
 more reasonable.
 bq. def union[VD: ClassTag, ED: ClassTag](other: Graph[VD, ED], mergeEdges: 
 (ED, ED) = ED): Graph[VD, ED]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8385) java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation

2015-06-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8385.
--
Resolution: Cannot Reproduce

 java.lang.UnsupportedOperationException: Not implemented by the TFS 
 FileSystem implementation
 -

 Key: SPARK-8385
 URL: https://issues.apache.org/jira/browse/SPARK-8385
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.4.0
 Environment: RHEL 7.1
Reporter: Peter Haumer

 I used to be able to debug my Spark apps in Eclipse. With Spark 1.3.1 I 
 created a launch and just set the vm var -Dspark.master=local[4].  
 With 1.4 this stopped working when reading files from the OS filesystem. 
 Running the same apps with spark-submit works fine.  Loosing the ability to 
 debug that way has a major impact on the usability of Spark.
 The following exception is thrown:
 Exception in thread main java.lang.UnsupportedOperationException: Not 
 implemented by the TFS FileSystem implementation
 at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:213)
 at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2401)
 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2411)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166)
 at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:653)
 at 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:389)
 at 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362)
 at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762)
 at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at scala.Option.map(Option.scala:145)
 at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1535)
 at org.apache.spark.rdd.RDD.reduce(RDD.scala:900)
 at org.apache.spark.api.java.JavaRDDLike$class.reduce(JavaRDDLike.scala:357)
 at org.apache.spark.api.java.AbstractJavaRDDLike.reduce(JavaRDDLike.scala:46)
 at com.databricks.apps.logs.LogAnalyzer.main(LogAnalyzer.java:60)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8385) java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation

2015-06-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605744#comment-14605744
 ] 

Ángel Álvarez edited comment on SPARK-8385 at 6/29/15 3:15 PM:
---

I could finally reproduce this same error in Eclipse (yarn-cluster mode) and it 
was due to a reference to the spark assembly 1.3 in my launch configuration. 


was (Author: angel2014):
I could finally reproduce this same error in Eclipse (yarn-cluster mode) and it 
was due a reference to the spark assembly 1.3 in my launch configuration. 

 java.lang.UnsupportedOperationException: Not implemented by the TFS 
 FileSystem implementation
 -

 Key: SPARK-8385
 URL: https://issues.apache.org/jira/browse/SPARK-8385
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.4.0
 Environment: RHEL 7.1
Reporter: Peter Haumer

 I used to be able to debug my Spark apps in Eclipse. With Spark 1.3.1 I 
 created a launch and just set the vm var -Dspark.master=local[4].  
 With 1.4 this stopped working when reading files from the OS filesystem. 
 Running the same apps with spark-submit works fine.  Loosing the ability to 
 debug that way has a major impact on the usability of Spark.
 The following exception is thrown:
 Exception in thread main java.lang.UnsupportedOperationException: Not 
 implemented by the TFS FileSystem implementation
 at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:213)
 at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2401)
 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2411)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166)
 at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:653)
 at 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:389)
 at 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362)
 at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762)
 at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at scala.Option.map(Option.scala:145)
 at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1535)
 at org.apache.spark.rdd.RDD.reduce(RDD.scala:900)
 at org.apache.spark.api.java.JavaRDDLike$class.reduce(JavaRDDLike.scala:357)
 at org.apache.spark.api.java.AbstractJavaRDDLike.reduce(JavaRDDLike.scala:46)
 at com.databricks.apps.logs.LogAnalyzer.main(LogAnalyzer.java:60)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8680) PropagateTypes is very slow when there are lots of columns

2015-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605747#comment-14605747
 ] 

Apache Spark commented on SPARK-8680:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/7087

 PropagateTypes is very slow when there are lots of columns
 --

 Key: SPARK-8680
 URL: https://issues.apache.org/jira/browse/SPARK-8680
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1, 1.4.0
Reporter: Davies Liu

 The time for PropagateTypes is O(N*N), N is the number of columns, which is 
 very slow if there many columns (1000)
 There easiest optimization could be put `q.inputSet` outside of  
 transformExpressions which could have about 4 times improvement for N=3000



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8680) PropagateTypes is very slow when there are lots of columns

2015-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8680:
---

Assignee: (was: Apache Spark)

 PropagateTypes is very slow when there are lots of columns
 --

 Key: SPARK-8680
 URL: https://issues.apache.org/jira/browse/SPARK-8680
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1, 1.4.0
Reporter: Davies Liu

 The time for PropagateTypes is O(N*N), N is the number of columns, which is 
 very slow if there many columns (1000)
 There easiest optimization could be put `q.inputSet` outside of  
 transformExpressions which could have about 4 times improvement for N=3000



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8680) PropagateTypes is very slow when there are lots of columns

2015-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8680:
---

Assignee: Apache Spark

 PropagateTypes is very slow when there are lots of columns
 --

 Key: SPARK-8680
 URL: https://issues.apache.org/jira/browse/SPARK-8680
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1, 1.4.0
Reporter: Davies Liu
Assignee: Apache Spark

 The time for PropagateTypes is O(N*N), N is the number of columns, which is 
 very slow if there many columns (1000)
 There easiest optimization could be put `q.inputSet` outside of  
 transformExpressions which could have about 4 times improvement for N=3000



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8385) java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation

2015-06-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605744#comment-14605744
 ] 

Ángel Álvarez commented on SPARK-8385:
--

I could finally reproduce this same error in Eclipse (yarn-cluster mode) and it 
was due a reference to the spark assembly 1.3 in my launch configuration. 

 java.lang.UnsupportedOperationException: Not implemented by the TFS 
 FileSystem implementation
 -

 Key: SPARK-8385
 URL: https://issues.apache.org/jira/browse/SPARK-8385
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.4.0
 Environment: RHEL 7.1
Reporter: Peter Haumer

 I used to be able to debug my Spark apps in Eclipse. With Spark 1.3.1 I 
 created a launch and just set the vm var -Dspark.master=local[4].  
 With 1.4 this stopped working when reading files from the OS filesystem. 
 Running the same apps with spark-submit works fine.  Loosing the ability to 
 debug that way has a major impact on the usability of Spark.
 The following exception is thrown:
 Exception in thread main java.lang.UnsupportedOperationException: Not 
 implemented by the TFS FileSystem implementation
 at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:213)
 at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2401)
 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2411)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166)
 at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:653)
 at 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:389)
 at 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362)
 at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762)
 at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
 at scala.Option.map(Option.scala:145)
 at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1535)
 at org.apache.spark.rdd.RDD.reduce(RDD.scala:900)
 at org.apache.spark.api.java.JavaRDDLike$class.reduce(JavaRDDLike.scala:357)
 at org.apache.spark.api.java.AbstractJavaRDDLike.reduce(JavaRDDLike.scala:46)
 at com.databricks.apps.logs.LogAnalyzer.main(LogAnalyzer.java:60)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8599) Use a Random operator to handle Random distribution generating expressions

2015-06-29 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605777#comment-14605777
 ] 

Burak Yavuz commented on SPARK-8599:


It would be great if it works for this case as well. I think [~mengxr] was 
hitting the bug during the filter phase for sampleBy.

 Use a Random operator to handle Random distribution generating expressions
 --

 Key: SPARK-8599
 URL: https://issues.apache.org/jira/browse/SPARK-8599
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Priority: Critical

 Right now, we are using expressions for Random distribution generating 
 expressions. But, we have to track them in lots of places in the optimizer to 
 handle them carefully. Otherwise, these expressions will be treated as 
 stateless expressions and have unexpected behaviors (e.g. SPARK-8023). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8702) Avoid massive concating strings in Javascript

2015-06-29 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-8702.
---
  Resolution: Fixed
   Fix Version/s: 1.5.0
Target Version/s: 1.5.0

 Avoid massive concating strings in Javascript
 -

 Key: SPARK-8702
 URL: https://issues.apache.org/jira/browse/SPARK-8702
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
 Fix For: 1.5.0


 When there are massive tasks, such as {{sc.parallelize(1 to 10, 
 1).count()}}, the generated JS codes have a lot of string concatenations 
 in the stage page, nearly 40 string concatenations for one task.
 We can generate the whole string for a task instead of execution string 
 concatenations in the browser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8706) Implement Pylint / Prospector checks for PySpark

2015-06-29 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-8706:
-

 Summary: Implement Pylint / Prospector checks for PySpark
 Key: SPARK-8706
 URL: https://issues.apache.org/jira/browse/SPARK-8706
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra, PySpark
Reporter: Josh Rosen


It would be nice to implement Pylint / Prospector 
(https://github.com/landscapeio/prospector) checks for PySpark. As with the 
style checker rules, I'll imagine that we'll want to roll out new rules 
gradually in order to avoid a mass refactoring commit.

For starters, we should create a pull request that introduces the harness for 
running the linters, add a configuration file which enables only the lint 
checks that currently pass, and install the required dependencies on Jenkins. 
Once we've done this, we can open a series of smaller followup PRs to gradually 
enable more linting checks and to fix existing violations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8703) Add CountVectorizer as a ml transformer to convert document to words count vector

2015-06-29 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605902#comment-14605902
 ] 

Liang-Chi Hsieh commented on SPARK-8703:


Does org.apache.spark.mllib.feature.HashingTF already provide similar function? 
If so, can this ml transformer reuse it?

 Add CountVectorizer as a ml transformer to convert document to words count 
 vector
 -

 Key: SPARK-8703
 URL: https://issues.apache.org/jira/browse/SPARK-8703
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: yuhao yang
   Original Estimate: 24h
  Remaining Estimate: 24h

 Converts a text document to a sparse vector of token counts. Similar to 
 http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
 I can further add an estimator to extract vocabulary from corpus if that's 
 appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8693) profiles and goals are not printed in a nice way

2015-06-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-8693:
--
Assignee: Brennon York

 profiles and goals are not printed in a nice way
 

 Key: SPARK-8693
 URL: https://issues.apache.org/jira/browse/SPARK-8693
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Project Infra
Affects Versions: 1.5.0
Reporter: Yin Huai
Assignee: Brennon York
Priority: Minor
 Fix For: 1.5.0


 In our master build, I see
 {code}
 -Phadoop-1[info] Building Spark (w/Hive 0.13.1) using SBT with these 
 arguments:  -Dhadoop.version=1.0.4[info] Building Spark (w/Hive 0.13.1) using 
 SBT with these arguments:  -Pkinesis-asl[info] Building Spark (w/Hive 0.13.1) 
 using SBT with these arguments:  -Phive-thriftserver[info] Building Spark 
 (w/Hive 0.13.1) using SBT with these arguments:  -Phive[info] Building Spark 
 (w/Hive 0.13.1) using SBT with these arguments:  package[info] Building Spark 
 (w/Hive 0.13.1) using SBT with these arguments:  assembly/assembly[info] 
 Building Spark (w/Hive 0.13.1) using SBT with these arguments:  
 streaming-kafka-assembly/assembly
 {code}
 Seems we format the string in a wrong way?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8686) DataFrame should support `where` with expression represented by String

2015-06-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8686:
-
Assignee: Kousuke Saruta

 DataFrame should support `where` with expression represented by String
 --

 Key: SPARK-8686
 URL: https://issues.apache.org/jira/browse/SPARK-8686
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Minor
 Fix For: 1.5.0


 DataFrame supports `filter` function with two types of argument, `Column` and 
 `String`. But `where` doesn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8554) Add the SparkR document files to `.rat-excludes` for `./dev/check-license`

2015-06-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-8554:
--
Assignee: Yu Ishikawa

 Add the SparkR document files to `.rat-excludes` for `./dev/check-license`
 --

 Key: SPARK-8554
 URL: https://issues.apache.org/jira/browse/SPARK-8554
 Project: Spark
  Issue Type: Bug
  Components: SparkR, Tests
Reporter: Yu Ishikawa
Assignee: Yu Ishikawa
 Fix For: 1.5.0


 {noformat}
  ./dev/check-license | grep -v boto
 Could not find Apache license headers in the following files:
  !? /Users/01004981/local/src/spark/myspark/R/lib/SparkR/INDEX
  !? /Users/01004981/local/src/spark/myspark/R/lib/SparkR/help/AnIndex
  !? /Users/01004981/local/src/spark/myspark/R/lib/SparkR/html/00Index.html
  !? /Users/01004981/local/src/spark/myspark/R/lib/SparkR/html/R.css
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/DataFrame.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/GroupedData.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/agg.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/arrange.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/cache-methods.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/cacheTable.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/cancelJobGroup.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/clearCache.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/clearJobGroup.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/collect-methods.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/column.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/columns.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/count.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/createDataFrame.Rd
  !? 
 /Users/01004981/local/src/spark/myspark/R/pkg/man/createExternalTable.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/describe.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/distinct.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/dropTempTable.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/dtypes.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/except.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/explain.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/filter.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/first.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/groupBy.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/hashCode.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/head.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/infer_type.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/insertInto.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/intersect.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/isLocal.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/join.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/jsonFile.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/limit.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/nafunctions.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/parquetFile.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/persist.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/print.jobj.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/print.structField.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/print.structType.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/printSchema.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/read.df.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/registerTempTable.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/repartition.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/sample.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/saveAsParquetFile.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/saveAsTable.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/schema.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/select.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/selectExpr.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/setJobGroup.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/show.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/showDF.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/sparkR.init.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/sparkR.stop.Rd
  !? 

[jira] [Resolved] (SPARK-8554) Add the SparkR document files to `.rat-excludes` for `./dev/check-license`

2015-06-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-8554.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6947
[https://github.com/apache/spark/pull/6947]

 Add the SparkR document files to `.rat-excludes` for `./dev/check-license`
 --

 Key: SPARK-8554
 URL: https://issues.apache.org/jira/browse/SPARK-8554
 Project: Spark
  Issue Type: Bug
  Components: SparkR, Tests
Reporter: Yu Ishikawa
 Fix For: 1.5.0


 {noformat}
  ./dev/check-license | grep -v boto
 Could not find Apache license headers in the following files:
  !? /Users/01004981/local/src/spark/myspark/R/lib/SparkR/INDEX
  !? /Users/01004981/local/src/spark/myspark/R/lib/SparkR/help/AnIndex
  !? /Users/01004981/local/src/spark/myspark/R/lib/SparkR/html/00Index.html
  !? /Users/01004981/local/src/spark/myspark/R/lib/SparkR/html/R.css
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/DataFrame.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/GroupedData.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/agg.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/arrange.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/cache-methods.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/cacheTable.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/cancelJobGroup.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/clearCache.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/clearJobGroup.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/collect-methods.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/column.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/columns.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/count.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/createDataFrame.Rd
  !? 
 /Users/01004981/local/src/spark/myspark/R/pkg/man/createExternalTable.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/describe.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/distinct.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/dropTempTable.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/dtypes.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/except.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/explain.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/filter.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/first.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/groupBy.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/hashCode.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/head.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/infer_type.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/insertInto.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/intersect.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/isLocal.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/join.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/jsonFile.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/limit.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/nafunctions.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/parquetFile.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/persist.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/print.jobj.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/print.structField.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/print.structType.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/printSchema.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/read.df.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/registerTempTable.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/repartition.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/sample.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/saveAsParquetFile.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/saveAsTable.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/schema.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/select.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/selectExpr.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/setJobGroup.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/show.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/showDF.Rd
  !? /Users/01004981/local/src/spark/myspark/R/pkg/man/sparkR.init.Rd
  !? 

[jira] [Assigned] (SPARK-8705) Javascript error in the web console when `totalExecutionTime` of a task is 0

2015-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8705:
---

Assignee: (was: Apache Spark)

 Javascript error in the web console when `totalExecutionTime` of a task is 0
 

 Key: SPARK-8705
 URL: https://issues.apache.org/jira/browse/SPARK-8705
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Shixiong Zhu

 Because System.currentTimeMillis() is not accurate for tasks that only need 
 several milliseconds, sometimes totalExecutionTime in makeTimeline will be 0. 
 If totalExecutionTime is 0, there will the following error in the console.
 !https://cloud.githubusercontent.com/assets/1000778/8406776/5cd38e04-1e92-11e5-89f2-0c5134fe4b6b.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8705) Javascript error in the web console when `totalExecutionTime` of a task is 0

2015-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605870#comment-14605870
 ] 

Apache Spark commented on SPARK-8705:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/7088

 Javascript error in the web console when `totalExecutionTime` of a task is 0
 

 Key: SPARK-8705
 URL: https://issues.apache.org/jira/browse/SPARK-8705
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Shixiong Zhu

 Because System.currentTimeMillis() is not accurate for tasks that only need 
 several milliseconds, sometimes totalExecutionTime in makeTimeline will be 0. 
 If totalExecutionTime is 0, there will the following error in the console.
 !https://cloud.githubusercontent.com/assets/1000778/8406776/5cd38e04-1e92-11e5-89f2-0c5134fe4b6b.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8693) profiles and goals are not printed in a nice way

2015-06-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-8693:
--
Affects Version/s: 1.5.0

 profiles and goals are not printed in a nice way
 

 Key: SPARK-8693
 URL: https://issues.apache.org/jira/browse/SPARK-8693
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Project Infra
Affects Versions: 1.5.0
Reporter: Yin Huai
Priority: Minor
 Fix For: 1.5.0


 In our master build, I see
 {code}
 -Phadoop-1[info] Building Spark (w/Hive 0.13.1) using SBT with these 
 arguments:  -Dhadoop.version=1.0.4[info] Building Spark (w/Hive 0.13.1) using 
 SBT with these arguments:  -Pkinesis-asl[info] Building Spark (w/Hive 0.13.1) 
 using SBT with these arguments:  -Phive-thriftserver[info] Building Spark 
 (w/Hive 0.13.1) using SBT with these arguments:  -Phive[info] Building Spark 
 (w/Hive 0.13.1) using SBT with these arguments:  package[info] Building Spark 
 (w/Hive 0.13.1) using SBT with these arguments:  assembly/assembly[info] 
 Building Spark (w/Hive 0.13.1) using SBT with these arguments:  
 streaming-kafka-assembly/assembly
 {code}
 Seems we format the string in a wrong way?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8693) profiles and goals are not printed in a nice way

2015-06-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-8693.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7085
[https://github.com/apache/spark/pull/7085]

 profiles and goals are not printed in a nice way
 

 Key: SPARK-8693
 URL: https://issues.apache.org/jira/browse/SPARK-8693
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Project Infra
Reporter: Yin Huai
Priority: Minor
 Fix For: 1.5.0


 In our master build, I see
 {code}
 -Phadoop-1[info] Building Spark (w/Hive 0.13.1) using SBT with these 
 arguments:  -Dhadoop.version=1.0.4[info] Building Spark (w/Hive 0.13.1) using 
 SBT with these arguments:  -Pkinesis-asl[info] Building Spark (w/Hive 0.13.1) 
 using SBT with these arguments:  -Phive-thriftserver[info] Building Spark 
 (w/Hive 0.13.1) using SBT with these arguments:  -Phive[info] Building Spark 
 (w/Hive 0.13.1) using SBT with these arguments:  package[info] Building Spark 
 (w/Hive 0.13.1) using SBT with these arguments:  assembly/assembly[info] 
 Building Spark (w/Hive 0.13.1) using SBT with these arguments:  
 streaming-kafka-assembly/assembly
 {code}
 Seems we format the string in a wrong way?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8670) Nested columns can't be referenced (but they can be selected)

2015-06-29 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605822#comment-14605822
 ] 

Nicholas Chammas commented on SPARK-8670:
-

Not sure. Does Scala offer the same flexibility in syntax like Python?

 Nested columns can't be referenced (but they can be selected)
 -

 Key: SPARK-8670
 URL: https://issues.apache.org/jira/browse/SPARK-8670
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Nicholas Chammas

 This is strange and looks like a regression from 1.3.
 {code}
 import json
 daterz = [
   {
 'name': 'Nick',
 'stats': {
   'age': 28
 }
   },
   {
 'name': 'George',
 'stats': {
   'age': 31
 }
   }
 ]
 df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))
 df.select('stats.age').show()
 df['stats.age']  # 1.4 fails on this line
 {code}
 On 1.3 this works and yields:
 {code}
 age
 28 
 31 
 Out[1]: Columnstats.age AS age#2958L
 {code}
 On 1.4, however, this gives an error on the last line:
 {code}
 +---+
 |age|
 +---+
 | 28|
 | 31|
 +---+
 ---
 IndexErrorTraceback (most recent call last)
 ipython-input-1-04bd990e94c6 in module()
  19 
  20 df.select('stats.age').show()
 --- 21 df['stats.age']
 /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
 678 if isinstance(item, basestring):
 679 if item not in self.columns:
 -- 680 raise IndexError(no such column: %s % item)
 681 jc = self._jdf.apply(item)
 682 return Column(jc)
 IndexError: no such column: stats.age
 {code}
 This means, among other things, that you can't join DataFrames on nested 
 columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8705) Javascript error in the web console when `totalExecutionTime` of a task is 0

2015-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8705:
---

Assignee: Apache Spark

 Javascript error in the web console when `totalExecutionTime` of a task is 0
 

 Key: SPARK-8705
 URL: https://issues.apache.org/jira/browse/SPARK-8705
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Shixiong Zhu
Assignee: Apache Spark

 Because System.currentTimeMillis() is not accurate for tasks that only need 
 several milliseconds, sometimes totalExecutionTime in makeTimeline will be 0. 
 If totalExecutionTime is 0, there will the following error in the console.
 !https://cloud.githubusercontent.com/assets/1000778/8406776/5cd38e04-1e92-11e5-89f2-0c5134fe4b6b.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8707) RDD#toDebugString fails if any cached RDD has invalid partitions

2015-06-29 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-8707:
--
Summary: RDD#toDebugString fails if any cached RDD has invalid partitions  
(was: RDD#toDebugString fails if any cached RDD is invalid)

 RDD#toDebugString fails if any cached RDD has invalid partitions
 

 Key: SPARK-8707
 URL: https://issues.apache.org/jira/browse/SPARK-8707
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0, 1.4.1
Reporter: Aaron Davidson
  Labels: starter

 Repro:
 {code}
 sc.parallelize(0 until 100).toDebugString
 sc.textFile(/ThisFileDoesNotExist).cache()
 sc.parallelize(0 until 100).toDebugString
 {code}
 Output:
 {code}
 java.io.IOException: Not a file: /ThisFileDoesNotExist
   at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:59)
   at 
 org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
   at 
 org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.SparkContext.getRDDStorageInfo(SparkContext.scala:1455)
   at org.apache.spark.rdd.RDD.debugSelf$1(RDD.scala:1573)
   at org.apache.spark.rdd.RDD.firstDebugString$1(RDD.scala:1607)
   at org.apache.spark.rdd.RDD.toDebugString(RDD.scala:1637
 {code}
 This is because toDebugString gets all the partitions from all RDDs, which 
 fails (via SparkContext#getRDDStorageInfo). This pathway should definitely be 
 resilient to other RDDs being invalid (and getRDDStorageInfo should probably 
 also be).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8707) RDD#toDebugString fails if any cached RDD has invalid partitions

2015-06-29 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-8707:
--
Description: 
Repro:

{code}
sc.textFile(/ThisFileDoesNotExist).cache()
sc.parallelize(0 until 100).toDebugString
{code}

Output:

{code}
java.io.IOException: Not a file: /ThisFileDoesNotExist
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:59)
at 
org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
at 
org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.SparkContext.getRDDStorageInfo(SparkContext.scala:1455)
at org.apache.spark.rdd.RDD.debugSelf$1(RDD.scala:1573)
at org.apache.spark.rdd.RDD.firstDebugString$1(RDD.scala:1607)
at org.apache.spark.rdd.RDD.toDebugString(RDD.scala:1637
{code}

This is because toDebugString gets all the partitions from all RDDs, which 
fails (via SparkContext#getRDDStorageInfo). This pathway should definitely be 
resilient to other RDDs being invalid (and getRDDStorageInfo should probably 
also be).

  was:
Repro:

{code}
sc.parallelize(0 until 100).toDebugString
sc.textFile(/ThisFileDoesNotExist).cache()
sc.parallelize(0 until 100).toDebugString
{code}

Output:

{code}
java.io.IOException: Not a file: /ThisFileDoesNotExist
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:59)
at 
org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
at 
org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.SparkContext.getRDDStorageInfo(SparkContext.scala:1455)
at org.apache.spark.rdd.RDD.debugSelf$1(RDD.scala:1573)
at org.apache.spark.rdd.RDD.firstDebugString$1(RDD.scala:1607)
at org.apache.spark.rdd.RDD.toDebugString(RDD.scala:1637
{code}

This is because toDebugString gets all the partitions from all RDDs, which 
fails (via SparkContext#getRDDStorageInfo). This pathway should definitely be 
resilient to other RDDs being invalid (and getRDDStorageInfo should probably 
also be).


 RDD#toDebugString fails if any cached RDD has invalid partitions
 

[jira] [Created] (SPARK-8707) RDD#toDebugString fails if any cached RDD is invalid

2015-06-29 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-8707:
-

 Summary: RDD#toDebugString fails if any cached RDD is invalid
 Key: SPARK-8707
 URL: https://issues.apache.org/jira/browse/SPARK-8707
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0, 1.4.1
Reporter: Aaron Davidson


Repro:

{code}
sc.parallelize(0 until 100).toDebugString
sc.textFile(/ThisFileDoesNotExist).cache()
sc.parallelize(0 until 100).toDebugString
{code}

Output:

{code}
java.io.IOException: Not a file: /ThisFileDoesNotExist
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:59)
at 
org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
at 
org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.SparkContext.getRDDStorageInfo(SparkContext.scala:1455)
at org.apache.spark.rdd.RDD.debugSelf$1(RDD.scala:1573)
at org.apache.spark.rdd.RDD.firstDebugString$1(RDD.scala:1607)
at org.apache.spark.rdd.RDD.toDebugString(RDD.scala:1637
{code}

This is because toDebugString gets all the partitions from all RDDs, which 
fails (via SparkContext#getRDDStorageInfo). This pathway should definitely be 
resilient to other RDDs being invalid (and getRDDStorageInfo should probably 
also be).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8708) MatrixFactorizationModel.predictAll() populates single partition only

2015-06-29 Thread Antony Mayi (JIRA)
Antony Mayi created SPARK-8708:
--

 Summary: MatrixFactorizationModel.predictAll() populates single 
partition only
 Key: SPARK-8708
 URL: https://issues.apache.org/jira/browse/SPARK-8708
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Antony Mayi


When using mllib.recommendation.ALS the RDD returned by .predictAll() has all 
values pushed into single partition despite using quite high parallelism.

This degrades performance of further processing (I can obviously run 
.partitionBy()) to balance it but that's still too costly (ie if running 
.predictAll() in loop for thousands of products) and should be possible to do 
it rather somehow on the model (automatically)).

Bellow is an example on tiny sample (same on large dataset):

{code:title=pyspark}
 r1 = (1, 1, 1.0)
 r2 = (1, 2, 2.0)
 r3 = (2, 1, 2.0)
 r4 = (2, 2, 2.0)
 r5 = (3, 1, 1.0)
 ratings = sc.parallelize([r1, r2, r3, r4, r5], 5)
 ratings.getNumPartitions()
5
 users = ratings.map(itemgetter(0)).distinct()
 model = ALS.trainImplicit(ratings, 1, seed=10)
 predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2)))
 predictions_for_2.glom().map(len).collect()
[0, 0, 3, 0, 0]
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8372) History server shows incorrect information for application not started

2015-06-29 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605996#comment-14605996
 ] 

Andrew Or commented on SPARK-8372:
--

OK, per discussion on the #6827 I reverted this.

 History server shows incorrect information for application not started
 --

 Key: SPARK-8372
 URL: https://issues.apache.org/jira/browse/SPARK-8372
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Web UI
Affects Versions: 1.4.0
Reporter: Carson Wang
Assignee: Marcelo Vanzin
Priority: Minor
 Attachments: IncorrectAppInfo.png


 The history server may show an incorrect App ID for an incomplete application 
 like App ID.inprogress. This app info will never disappear even after the 
 app is completed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-8372) History server shows incorrect information for application not started

2015-06-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reopened SPARK-8372:
--
  Assignee: Marcelo Vanzin  (was: Carson Wang)

 History server shows incorrect information for application not started
 --

 Key: SPARK-8372
 URL: https://issues.apache.org/jira/browse/SPARK-8372
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Web UI
Affects Versions: 1.4.0
Reporter: Carson Wang
Assignee: Marcelo Vanzin
Priority: Minor
 Attachments: IncorrectAppInfo.png


 The history server may show an incorrect App ID for an incomplete application 
 like App ID.inprogress. This app info will never disappear even after the 
 app is completed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8705) Javascript error in the web console when `totalExecutionTime` of a task is 0

2015-06-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8705:
-
Affects Version/s: 1.4.0

 Javascript error in the web console when `totalExecutionTime` of a task is 0
 

 Key: SPARK-8705
 URL: https://issues.apache.org/jira/browse/SPARK-8705
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Shixiong Zhu

 Because System.currentTimeMillis() is not accurate for tasks that only need 
 several milliseconds, sometimes totalExecutionTime in makeTimeline will be 0. 
 If totalExecutionTime is 0, there will the following error in the console.
 !https://cloud.githubusercontent.com/assets/1000778/8406776/5cd38e04-1e92-11e5-89f2-0c5134fe4b6b.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8705) Javascript error in the web console when `totalExecutionTime` of a task is 0

2015-06-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8705:
-
Target Version/s: 1.5.0, 1.4.2

 Javascript error in the web console when `totalExecutionTime` of a task is 0
 

 Key: SPARK-8705
 URL: https://issues.apache.org/jira/browse/SPARK-8705
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Shixiong Zhu

 Because System.currentTimeMillis() is not accurate for tasks that only need 
 several milliseconds, sometimes totalExecutionTime in makeTimeline will be 0. 
 If totalExecutionTime is 0, there will the following error in the console.
 !https://cloud.githubusercontent.com/assets/1000778/8406776/5cd38e04-1e92-11e5-89f2-0c5134fe4b6b.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8622) Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor classpath

2015-06-29 Thread Baswaraj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605921#comment-14605921
 ] 

Baswaraj commented on SPARK-8622:
-

Thats what i mean. Jars specified by --jars are not put on classpath, but are 
in working directory of executor. I am expecting either jars to be on classpath 
or working directory to be on classpath.
In 1.3.0, working directory is in classpath.
In 1.3.1 + neither jars nor working directory on classpath.

 Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor 
 classpath
 --

 Key: SPARK-8622
 URL: https://issues.apache.org/jira/browse/SPARK-8622
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.3.1, 1.4.0
Reporter: Baswaraj

 I ran into an issue that executor not able to pickup my configs/ function 
 from my custom jar in standalone (client/cluster) deploy mode. I have used 
 spark-submit --Jar option to specify all my jars and configs to be used by 
 executors.
 all these files are placed in working directory of executor, but not in 
 executor classpath.  Also, executor working directory is not in executor 
 classpath.
 I am expecting executor to find all files specified in spark-submit --jar 
 options .
 In spark 1.3.0 executor working directory is in executor classpath, so app 
 runs successfully.
 To successfully run my application with spark 1.3.1 +, i have to use  
 following option  (conf/spark-defaults.conf)
 spark.executor.extraClassPath   .
 Please advice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3692) RBF Kernel implementation to SVM

2015-06-29 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605930#comment-14605930
 ] 

Seth Hendrickson commented on SPARK-3692:
-

It looks like this JIRA will be taken care of by  
[SPARK-4638|https://issues.apache.org/jira/browse/SPARK-4638]. I suspect this 
should be closed as SPARK-4638 contains significant work in progress.

 RBF Kernel implementation to SVM
 

 Key: SPARK-3692
 URL: https://issues.apache.org/jira/browse/SPARK-3692
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Ekrem Aksoy
Priority: Minor

 Radial Basis Function is another type of kernel that can be used instead of 
 linear kernel in SVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8567) Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars

2015-06-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8567:
-
Fix Version/s: 1.4.1

 Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars
 --

 Key: SPARK-8567
 URL: https://issues.apache.org/jira/browse/SPARK-8567
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 1.4.1
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical
  Labels: flaky-test
 Fix For: 1.4.1, 1.5.0


 Seems tests in HiveSparkSubmitSuite fail with timeout pretty frequently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8410) Hive VersionsSuite RuntimeException

2015-06-29 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605977#comment-14605977
 ] 

Burak Yavuz commented on SPARK-8410:


Hi Joe,
Is it possible to delete those files 
(~/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml) from the 
faulty servers? Maybe it would be better to have Spark delete it beforehand. 
That would however mean that the resolution phase will always take a while, 
because the whereabouts of the artifacts are never cached.

 Hive VersionsSuite RuntimeException
 ---

 Key: SPARK-8410
 URL: https://issues.apache.org/jira/browse/SPARK-8410
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1, 1.4.0
 Environment: IBM Power system - P7
 running Ubuntu 14.04LE
Reporter: Josiah Samuel Sathiadass
Assignee: Burak Yavuz
Priority: Minor

 While testing Spark Project Hive, there are RuntimeExceptions as follows,
 VersionsSuite:
 - success sanity check *** FAILED ***
   java.lang.RuntimeException: [download failed: 
 org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed: 
 org.codehaus.groovy#groovy-all;2.1.6!groovy-all.jar, download failed: 
 asm#asm;3.2!asm.jar]
   at 
 org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978)
   at 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62)
   at 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62)
   at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:38)
   at 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$.org$apache$spark$sql$hive$client$IsolatedClientLoader$$downloadVersion(IsolatedClientLoader.scala:61)
   at 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$1.apply(IsolatedClientLoader.scala:44)
   at 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$1.apply(IsolatedClientLoader.scala:44)
   at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189)
   at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
   at 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:44)
   ...
 The tests are executed with the following set of options,
 build/mvn --pl sql/hive --fail-never -Pyarn -Phadoop-2.4 
 -Dhadoop.version=2.6.0 test
 Adding the following dependencies in the spark/sql/hive/pom.xml  file 
 solves this issue,
  dependency
  groupIdorg.jboss.netty/groupId
  artifactIdnetty/artifactId
  version3.2.2.Final/version
  scopetest/scope
  /dependency
  dependency
  groupIdorg.codehaus.groovy/groupId
  artifactIdgroovy-all/artifactId
  version2.1.6/version
  scopetest/scope
  /dependency
  
  dependency
  groupIdasm/groupId
  artifactIdasm/artifactId
  version3.2/version
  scopetest/scope
  /dependency
  
 The question is, Is this the correct way to fix this runtimeException ?
 If yes, Can a pull request fix this issue permanently ?
 If not, suggestions please.
 Updates:
 The above mentioned quick fix is not working with the latest 1.4 because of
 this pull commits :
  [SPARK-8095] Resolve dependencies of --packages in local ivy cache #6788 
 https://github.com/apache/spark/pull/6788
 Due to this above commit, now the lookup directories during testing phase
 has changed as follows,
 :: problems summary ::
  WARNINGS
   [NOT FOUND  ] 
 org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle) (2ms)
    local-m2-cache: tried
 
 file:/home/joe/sparkibmsoe/spark/sql/hive/dummy/.m2/repository/org/jboss/netty/netty/3.2.2.Final/netty-3.2.2.Final.jar
   [NOT FOUND  ] 
 org.codehaus.groovy#groovy-all;2.1.6!groovy-all.jar (0ms)
    local-m2-cache: tried
 
 file:/home/joe/sparkibmsoe/spark/sql/hive/dummy/.m2/repository/org/codehaus/groovy/groovy-all/2.1.6/groovy-all-2.1.6.jar
   [NOT FOUND  ] asm#asm;3.2!asm.jar (0ms)
    local-m2-cache: tried
 
 file:/home/joe/sparkibmsoe/spark/sql/hive/dummy/.m2/repository/asm/asm/3.2/asm-3.2.jar
   ::
   ::  FAILED DOWNLOADS::
   :: ^ see resolution messages for details  ^ ::
   ::
   :: org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle)
   :: org.codehaus.groovy#groovy-all;2.1.6!groovy-all.jar
   :: asm#asm;3.2!asm.jar
   ::



--
This message was sent by Atlassian 

[jira] [Updated] (SPARK-8372) History server shows incorrect information for application not started

2015-06-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8372:
-
Fix Version/s: (was: 1.4.1)
   (was: 1.5.0)

 History server shows incorrect information for application not started
 --

 Key: SPARK-8372
 URL: https://issues.apache.org/jira/browse/SPARK-8372
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Web UI
Affects Versions: 1.4.0
Reporter: Carson Wang
Assignee: Marcelo Vanzin
Priority: Minor
 Attachments: IncorrectAppInfo.png


 The history server may show an incorrect App ID for an incomplete application 
 like App ID.inprogress. This app info will never disappear even after the 
 app is completed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8567) Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars

2015-06-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-8567.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars
 --

 Key: SPARK-8567
 URL: https://issues.apache.org/jira/browse/SPARK-8567
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 1.4.1
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical
  Labels: flaky-test
 Fix For: 1.5.0


 Seems tests in HiveSparkSubmitSuite fail with timeout pretty frequently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8567) Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars

2015-06-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8567:
-
Target Version/s: 1.4.1, 1.5.0  (was: 1.5.0)

 Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars
 --

 Key: SPARK-8567
 URL: https://issues.apache.org/jira/browse/SPARK-8567
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 1.4.1
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical
  Labels: flaky-test
 Fix For: 1.4.1, 1.5.0


 Seems tests in HiveSparkSubmitSuite fail with timeout pretty frequently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8475) SparkSubmit with Ivy jars is very slow to load with no internet access

2015-06-29 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605968#comment-14605968
 ] 

Burak Yavuz commented on SPARK-8475:


ping. I think you can go ahead with a PR for option 1. If you're too busy, I 
can submit one!

 SparkSubmit with Ivy jars is very slow to load with no internet access
 --

 Key: SPARK-8475
 URL: https://issues.apache.org/jira/browse/SPARK-8475
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.4.0
Reporter: Nathan McCarthy
Priority: Minor

 Spark Submit adds maven central  spark bintray to the ChainResolver before 
 it adds any external resolvers. 
 https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L821
 When running on a cluster without internet access, this means the spark shell 
 takes forever to launch as it tries these two remote repos before the ones 
 specified in the --repositories list. In our case we have a proxy which the 
 cluster can access it and supply it via --repositories.
 This is also a problem for users who maintain a proxy for maven/ivy repos 
 with something like Nexus/Artifactory. Having a repo proxy is popular at many 
 organisations so I'd say this would be a useful change for these users as 
 well. In the current state even if a maven central proxy is supplied, it will 
 still try and hit central. 
 I see two options for a fix;
 * Change the order repos are added to the ChainResolver, making the 
 --repositories supplied repos come before anything else. 
 https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L843
  
 * Have a config option (like spark.jars.ivy.useDefaultRemoteRepos, default 
 true) which when false wont add the maven central  bintry to the 
 ChainResolver. 
 Happy to do a PR for this fix. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-701) Wrong SPARK_MEM setting with different EC2 master and worker machine types

2015-06-29 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606622#comment-14606622
 ] 

Shivaram Venkataraman commented on SPARK-701:
-

Yeah so SPARK_MEM used to be used for both master and executors before. Right 
now we have two separate variables spark.executor.memory and 
spark.driver.memory that we can set. Lets open a new issue for this.

 Wrong SPARK_MEM setting with different EC2 master and worker machine types
 --

 Key: SPARK-701
 URL: https://issues.apache.org/jira/browse/SPARK-701
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 0.7.0
Reporter: Josh Rosen
Assignee: Shivaram Venkataraman
 Fix For: 0.7.0


 When launching a spark-ec2 cluster using different worker and master machine 
 types, SPARK_MEM in spark-env.sh is set based on the master's memory instead 
 of the worker's.  This causes jobs to hang if the master has more memory than 
 the workers (because jobs will request too much memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8588) Could not use concat with UDF in where clause

2015-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606650#comment-14606650
 ] 

Apache Spark commented on SPARK-8588:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/7103

 Could not use concat with UDF in where clause
 -

 Key: SPARK-8588
 URL: https://issues.apache.org/jira/browse/SPARK-8588
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
 Environment: Centos 7, java 1.7.0_67, scala 2.10.5, run in a spark 
 standalone cluster(or local).
Reporter: StanZhai
Assignee: Wenchen Fan
Priority: Critical

 After upgraded the cluster from spark 1.3.1 to 1.4.0(rc4), I encountered the 
 following exception when use concat with UDF in where clause: 
 {code}
 org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
 dataType on unresolved object, tree: 
 'concat(HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(date#1776),年) 
 at 
 org.apache.spark.sql.catalyst.analysis.UnresolvedFunction.dataType(unresolved.scala:82)
  
 at 
 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299)
  
 at 
 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299)
  
 at 
 scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) 
 at scala.collection.immutable.List.exists(List.scala:84) 
 at 
 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:299)
  
 at 
 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:298)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
  
 at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:75)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:85)
  
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
 at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) 
 at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) 
 at scala.collection.AbstractIterator.to(Iterator.scala:1157) 
 at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) 
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) 
 at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) 
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) 
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:94)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:64)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:136)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:135)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
  
 at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)
  
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
 at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
 at 
 

[jira] [Commented] (SPARK-8716) Remove executor shared cache feature

2015-06-29 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606649#comment-14606649
 ] 

Marcelo Vanzin commented on SPARK-8716:
---

bq. AFAIK this feature doesn't work under YARN or Mesos.

I haven't checked recently but I believe it works on YARN. YARN behaves 
similarly in that there is a shared app dir (or dirs depending on YARN's 
config). But off the top of my head I don't remember whether Spark points at 
the app dir or the container dir for its own temp files.

 Remove executor shared cache feature
 

 Key: SPARK-8716
 URL: https://issues.apache.org/jira/browse/SPARK-8716
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Josh Rosen
Priority: Minor

 More specifically, this is the feature that is currently flagged by 
 `spark.files.useFetchCache`. There are several reasons why we should remove 
 it.
 (1) It doesn't even work. Recently, each executor gets its own unique temp 
 directory for security reasons.
 (2) There is no way to fix it. The constraints in (1) are fundamentally 
 opposed to sharing resources across executors.
 (3) It is very complex. The method Utils.fetchFile would be greatly 
 simplified without this feature that is not even used.
 (4) There are no tests for it and it is difficult to test.
 Note that we can't just revert the respective patches because they were 
 merged a long time ago.
 Related issues: SPARK-8130, SPARK-6313, SPARK-2713



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8716) Remove executor shared cache feature

2015-06-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8716:
-
Priority: Major  (was: Minor)

 Remove executor shared cache feature
 

 Key: SPARK-8716
 URL: https://issues.apache.org/jira/browse/SPARK-8716
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Josh Rosen

 More specifically, this is the feature that is currently flagged by 
 `spark.files.useFetchCache`. There are several reasons why we should remove 
 it.
 (1) It doesn't even work. Recently, each executor gets its own unique temp 
 directory for security reasons.
 (2) There is no way to fix it. The constraints in (1) are fundamentally 
 opposed to sharing resources across executors.
 (3) It is very complex. The method Utils.fetchFile would be greatly 
 simplified without this feature that is not even used.
 (4) There are no tests for it and it is difficult to test.
 Note that we can't just revert the respective patches because they were 
 merged a long time ago.
 Related issues: SPARK-8130, SPARK-6313, SPARK-2713



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8669) Parquet 1.7 files that store binary enums crash when inferring schema

2015-06-29 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8669:
--
Target Version/s: 1.5.0

 Parquet 1.7 files that store binary enums crash when inferring schema
 -

 Key: SPARK-8669
 URL: https://issues.apache.org/jira/browse/SPARK-8669
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Steven She
Assignee: Steven She

 Loading a Parquet 1.7 file that contains a binary ENUM field in Spark 
 1.5-SNAPSHOT crashes with the following exception:
 {noformat}
   org.apache.spark.sql.AnalysisException: Illegal Parquet type: BINARY (ENUM);
   at 
 org.apache.spark.sql.parquet.CatalystSchemaConverter.illegalType$1(CatalystSchemaConverter.scala:129)
   at 
 org.apache.spark.sql.parquet.CatalystSchemaConverter.convertPrimitiveField(CatalystSchemaConverter.scala:184)
   at 
 org.apache.spark.sql.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:114)
 ...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4069) [SPARK-YARN] ApplicationMaster should release all executors' containers before unregistering itself from Yarn RM

2015-06-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4069.

Resolution: Won't Fix

 [SPARK-YARN] ApplicationMaster should release all executors' containers 
 before unregistering itself from Yarn RM
 

 Key: SPARK-4069
 URL: https://issues.apache.org/jira/browse/SPARK-4069
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Min Zhou

 Curently,  ApplciationMaster in yarn mode simply unregister itself from yarn 
 master , a.k.a resourcemanager.  Itnever release executors' containers before 
 that.  Yarn's master will make a decision to kill all the executors' 
 containers if it face such scenario.  so the log of resourcemanager is like 
 below 
 {noformat}
 2014-10-22 23:39:09,903 DEBUG 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Processing event for appattempt_1414003182949_0004_01 of type UNREGISTERED
 2014-10-22 23:39:09,903 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1414003182949_0004_01 State change from RUNNING to FINAL_SAVING
 2014-10-22 23:39:09,903 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating 
 application application_1414003182949_0004 with final state: FINISHING
 2014-10-22 23:39:09,903 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
 application_1414003182949_0004 State change from RUNNING to FINAL_SAVING
 2014-10-22 23:39:09,903 DEBUG 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Processing event for appattempt_1414003182949_0004_01 of type 
 ATTEMPT_UPDATE_SAVED
 2014-10-22 23:39:09,903 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing 
 info for app: application_1414003182949_0004
 2014-10-22 23:39:09,903 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1414003182949_0004_01 State change from FINAL_SAVING to 
 FINISHING
 2014-10-22 23:39:09,903 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
 application_1414003182949_0004 State change from FINAL_SAVING to FINISHING
 2014-10-22 23:39:10,485 DEBUG 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Processing event for appattempt_1414003182949_0004_01 of type 
 CONTAINER_FINISHED
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 container_1414003182949_0004_01_01 Container Transitioned from RUNNING to 
 COMPLETED
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 Unregistering app attempt : appattempt_1414003182949_0004_01
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: 
 Completed container: container_1414003182949_0004_01_01 in state: 
 COMPLETED event:FINISHED
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
  Finish information of container container_1414003182949_0004_01_01 is 
 written
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1414003182949_0004_01 State change from FINISHING to FINISHED
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim   
 OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS  
 APPID=application_1414003182949_0004
 CONTAINERID=container_1414003182949_0004_01_01
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
 Stored the finish data of container container_1414003182949_0004_01_01
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
 Released container container_1414003182949_0004_01_01 of capacity 
 memory:3072, vCores:1 on host host1, which currently has 0 containers, 
 memory:0, vCores:0 used and memory:241901, vCores:32 available, release 
 resources=true
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
 application_1414003182949_0004 State change from FINISHING to FINISHED
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
  Finish information of application attempt 
 appattempt_1414003182949_0004_01 is written
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim   
 OPERATION=Application Finished - Succeeded  

[jira] [Commented] (SPARK-4069) [SPARK-YARN] ApplicationMaster should release all executors' containers before unregistering itself from Yarn RM

2015-06-29 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606701#comment-14606701
 ] 

Andrew Or commented on SPARK-4069:
--

Fixed in YARN-3415.

 [SPARK-YARN] ApplicationMaster should release all executors' containers 
 before unregistering itself from Yarn RM
 

 Key: SPARK-4069
 URL: https://issues.apache.org/jira/browse/SPARK-4069
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Min Zhou

 Curently,  ApplciationMaster in yarn mode simply unregister itself from yarn 
 master , a.k.a resourcemanager.  Itnever release executors' containers before 
 that.  Yarn's master will make a decision to kill all the executors' 
 containers if it face such scenario.  so the log of resourcemanager is like 
 below 
 {noformat}
 2014-10-22 23:39:09,903 DEBUG 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Processing event for appattempt_1414003182949_0004_01 of type UNREGISTERED
 2014-10-22 23:39:09,903 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1414003182949_0004_01 State change from RUNNING to FINAL_SAVING
 2014-10-22 23:39:09,903 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating 
 application application_1414003182949_0004 with final state: FINISHING
 2014-10-22 23:39:09,903 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
 application_1414003182949_0004 State change from RUNNING to FINAL_SAVING
 2014-10-22 23:39:09,903 DEBUG 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Processing event for appattempt_1414003182949_0004_01 of type 
 ATTEMPT_UPDATE_SAVED
 2014-10-22 23:39:09,903 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing 
 info for app: application_1414003182949_0004
 2014-10-22 23:39:09,903 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1414003182949_0004_01 State change from FINAL_SAVING to 
 FINISHING
 2014-10-22 23:39:09,903 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
 application_1414003182949_0004 State change from FINAL_SAVING to FINISHING
 2014-10-22 23:39:10,485 DEBUG 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Processing event for appattempt_1414003182949_0004_01 of type 
 CONTAINER_FINISHED
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
 container_1414003182949_0004_01_01 Container Transitioned from RUNNING to 
 COMPLETED
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 Unregistering app attempt : appattempt_1414003182949_0004_01
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: 
 Completed container: container_1414003182949_0004_01_01 in state: 
 COMPLETED event:FINISHED
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
  Finish information of container container_1414003182949_0004_01_01 is 
 written
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1414003182949_0004_01 State change from FINISHING to FINISHED
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim   
 OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS  
 APPID=application_1414003182949_0004
 CONTAINERID=container_1414003182949_0004_01_01
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter: 
 Stored the finish data of container container_1414003182949_0004_01_01
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
 Released container container_1414003182949_0004_01_01 of capacity 
 memory:3072, vCores:1 on host host1, which currently has 0 containers, 
 memory:0, vCores:0 used and memory:241901, vCores:32 available, release 
 resources=true
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
 application_1414003182949_0004 State change from FINISHING to FINISHED
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
  Finish information of application attempt 
 appattempt_1414003182949_0004_01 is written
 2014-10-22 23:39:10,485 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=akim   

[jira] [Closed] (SPARK-8634) Fix flaky test StreamingListenerSuite receiver info reporting

2015-06-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-8634.

  Resolution: Fixed
Assignee: Shixiong Zhu
   Fix Version/s: 1.4.2
  1.5.0
Target Version/s: 1.5.0, 1.4.2

 Fix flaky test StreamingListenerSuite receiver info reporting
 ---

 Key: SPARK-8634
 URL: https://issues.apache.org/jira/browse/SPARK-8634
 Project: Spark
  Issue Type: Test
  Components: Streaming, Tests
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Critical
  Labels: flaky-test
 Fix For: 1.5.0, 1.4.2


 As per the unit test log in 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35754/
 {code}
 15/06/24 23:09:10.210 Thread-3495 INFO ReceiverTracker: Starting 1 receivers
 15/06/24 23:09:10.270 Thread-3495 INFO SparkContext: Starting job: apply at 
 Transformer.scala:22
 ...
 15/06/24 23:09:14.259 ForkJoinPool-4-worker-29 INFO 
 StreamingListenerSuiteReceiver: Started receiver and sleeping
 15/06/24 23:09:14.270 ForkJoinPool-4-worker-29 INFO 
 StreamingListenerSuiteReceiver: Reporting error and sleeping
 {code}
 it needs at least 4 seconds to receive all receiver events in this slow 
 machine, but `timeout` for `eventually` is only 2 seconds.
 We can increase `timeout` to make this test stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8119) HeartbeatReceiver should not call sc.killExecutor

2015-06-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8119:
-
Summary: HeartbeatReceiver should not call sc.killExecutor  (was: Spark 
will set total executor when some executors fail.)

 HeartbeatReceiver should not call sc.killExecutor
 -

 Key: SPARK-8119
 URL: https://issues.apache.org/jira/browse/SPARK-8119
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.4.0
Reporter: SaintBacchus

 DynamicAllocation will set the total executor to a little number when it 
 wants to kill some executors.
 But in no-DynamicAllocation scenario, Spark will also set the total executor.
 So it will cause such problem: sometimes an executor fails down, there is no 
 more executor which will be pull up by spark.
 === EDIT by andrewor14 ===
 The issue is that the AM forgets about the original number of executors it 
 wants after calling sc.killExecutor. Even if dynamic allocation is not 
 enabled, this is still possible because of heartbeat timeouts.
 I think the problem is that sc.killExecutor is used incorrectly in 
 HeartbeatReceiver. The intention of the method is to permanently adjust the 
 number of executors the application will get. In HeartbeatReceiver, however, 
 this is used as a best-effort mechanism to ensure that the timed out executor 
 is dead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8374) Job frequently hangs after YARN preemption

2015-06-29 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605425#comment-14605425
 ] 

Shay Rojansky commented on SPARK-8374:
--

Thanks for your comment and sure, I can help test. I may need a bit of 
hand-holding since I haven't built Spark yet.

 Job frequently hangs after YARN preemption
 --

 Key: SPARK-8374
 URL: https://issues.apache.org/jira/browse/SPARK-8374
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.0
 Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04
Reporter: Shay Rojansky
Priority: Critical

 After upgrading to Spark 1.4.0, jobs that get preempted very frequently will 
 not reacquire executors and will therefore hang. To reproduce:
 1. I run Spark job A that acquires all grid resources
 2. I run Spark job B in a higher-priority queue that acquires all grid 
 resources. Job A is fully preempted.
 3. Kill job B, releasing all resources
 4. Job A should at this point reacquire all grid resources, but occasionally 
 doesn't. Repeating the preemption scenario makes the bad behavior occur 
 within a few attempts.
 (see logs at bottom).
 Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption 
 issues, maybe the work there is related to the new issues.
 The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've 
 downgraded to 1.3.1 just because of this issue).
 Logs
 --
 When job B (the preemptor first acquires an application master, the following 
 is logged by job A (the preemptee):
 {noformat}
 ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc 
 client disassociated
 INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0
 WARN ReliableDeliverySupervisor: Association with remote system 
 [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address 
 is now gated for [5000] ms. Reason is: [Disassociated].
 WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, 
 g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost)
 INFO DAGScheduler: Executor lost: 447 (epoch 0)
 INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from 
 BlockManagerMaster.
 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, 
 g023.grid.eaglerd.local, 41406)
 INFO BlockManagerMaster: Removed 447 successfully in removeExecutor
 {noformat}
 (It's strange for errors/warnings to be logged for preemption)
 Later, when job B's AM starts requesting its resources, I get lots of the 
 following in job A:
 {noformat}
 ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc 
 client disassociated
 INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0
 WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, 
 g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost)
 WARN ReliableDeliverySupervisor: Association with remote system 
 [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address 
 is now gated for [5000] ms. Reason is: [Disassociated].
 {noformat}
 Finally, when I kill job B, job A emits lots of the following:
 {noformat}
 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31
 WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist!
 {noformat}
 And finally after some time:
 {noformat}
 WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 
 165964 ms exceeds timeout 12 ms
 ERROR YarnScheduler: Lost an executor 466 (already removed): Executor 
 heartbeat timed out after 165964 ms
 {noformat}
 At this point the job never requests/acquires more resources and hangs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8702) Avoid massive concating strings in Javascript

2015-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8702:
---

Assignee: Apache Spark

 Avoid massive concating strings in Javascript
 -

 Key: SPARK-8702
 URL: https://issues.apache.org/jira/browse/SPARK-8702
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Shixiong Zhu
Assignee: Apache Spark

 When there are massive tasks, such as {{sc.parallelize(1 to 10, 
 1).count()}}, the generated JS codes have a lot of string concatenations 
 in the stage page, nearly 40 string concatenations for one task.
 We can generate the whole string for a task instead of execution string 
 concatenations in the browser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8702) Avoid massive concating strings in Javascript

2015-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8702:
---

Assignee: (was: Apache Spark)

 Avoid massive concating strings in Javascript
 -

 Key: SPARK-8702
 URL: https://issues.apache.org/jira/browse/SPARK-8702
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Shixiong Zhu

 When there are massive tasks, such as {{sc.parallelize(1 to 10, 
 1).count()}}, the generated JS codes have a lot of string concatenations 
 in the stage page, nearly 40 string concatenations for one task.
 We can generate the whole string for a task instead of execution string 
 concatenations in the browser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8702) Avoid massive concating strings in Javascript

2015-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605452#comment-14605452
 ] 

Apache Spark commented on SPARK-8702:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/7082

 Avoid massive concating strings in Javascript
 -

 Key: SPARK-8702
 URL: https://issues.apache.org/jira/browse/SPARK-8702
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Shixiong Zhu

 When there are massive tasks, such as {{sc.parallelize(1 to 10, 
 1).count()}}, the generated JS codes have a lot of string concatenations 
 in the stage page, nearly 40 string concatenations for one task.
 We can generate the whole string for a task instead of execution string 
 concatenations in the browser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8703) Add CountVectorizer as a ml transformer to convert document to words count vector

2015-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8703:
---

Assignee: Apache Spark

 Add CountVectorizer as a ml transformer to convert document to words count 
 vector
 -

 Key: SPARK-8703
 URL: https://issues.apache.org/jira/browse/SPARK-8703
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: yuhao yang
Assignee: Apache Spark
   Original Estimate: 24h
  Remaining Estimate: 24h

 Converts a text document to a sparse vector of token counts.
 I can further add an estimator to extract vocabulary from corpus if that's 
 appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8703) Add CountVectorizer as a ml transformer to convert document to words count vector

2015-06-29 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-8703:
--
Description: 
Converts a text document to a sparse vector of token counts. Similar to 
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

I can further add an estimator to extract vocabulary from corpus if that's 
appropriate.



  was:
Converts a text document to a sparse vector of token counts.

I can further add an estimator to extract vocabulary from corpus if that's 
appropriate.




 Add CountVectorizer as a ml transformer to convert document to words count 
 vector
 -

 Key: SPARK-8703
 URL: https://issues.apache.org/jira/browse/SPARK-8703
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: yuhao yang
   Original Estimate: 24h
  Remaining Estimate: 24h

 Converts a text document to a sparse vector of token counts. Similar to 
 http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
 I can further add an estimator to extract vocabulary from corpus if that's 
 appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8703) Add CountVectorizer as a ml transformer to convert document to words count vector

2015-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8703:
---

Assignee: (was: Apache Spark)

 Add CountVectorizer as a ml transformer to convert document to words count 
 vector
 -

 Key: SPARK-8703
 URL: https://issues.apache.org/jira/browse/SPARK-8703
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: yuhao yang
   Original Estimate: 24h
  Remaining Estimate: 24h

 Converts a text document to a sparse vector of token counts.
 I can further add an estimator to extract vocabulary from corpus if that's 
 appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8235) misc function: sha1 / sha

2015-06-29 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8235.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6963
[https://github.com/apache/spark/pull/6963]

 misc function: sha1 / sha
 -

 Key: SPARK-8235
 URL: https://issues.apache.org/jira/browse/SPARK-8235
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
 Fix For: 1.5.0


 sha1(string/binary): string
 sha(string/binary): string
 Calculates the SHA-1 digest for string or binary and returns the value as a 
 hex string (as of Hive 1.3.0). Example: sha1('ABC') = 
 '3c01bdbb26f358bab27f267924aa2c9a03fcfdb8'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6830) Memoize frequently queried vals in RDD, such as numPartitions, count etc.

2015-06-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606157#comment-14606157
 ] 

Sean Owen commented on SPARK-6830:
--

Is this valid? For example, consider an RDD from a file that's being written 
to. count() would return larger values each time it is called. Caching it would 
change this behavior. Of course, caching the RDD would also mean the count was 
then fixed, but these are semantically different.

 Memoize frequently queried vals in RDD, such as numPartitions, count etc.
 -

 Key: SPARK-6830
 URL: https://issues.apache.org/jira/browse/SPARK-6830
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Minor
  Labels: Starter

 We should memoize frequently queried vals in RDD, such as numPartitions, 
 count etc.
 While using SparkR in RStudio, the `count` function seems to be called 
 frequently by the IDE – I think this is to show some stats about variables in 
 the workspace etc. but this is not great in SparkR as we trigger a job every 
 time count is called.
 Memoization would help in this case, but we should also see if there is some 
 better way to interact with RStudio.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5571) LDA should handle text as well

2015-06-29 Thread Alok Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606161#comment-14606161
 ] 

Alok Singh edited comment on SPARK-5571 at 6/29/15 7:00 PM:


I would like to work to it.


was (Author: aloknsingh):
I would like to work to it if everyone is ok .

 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-06-29 Thread Alok Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606159#comment-14606159
 ] 

Alok Singh commented on SPARK-5571:
---

Since there is already Tokenizer class. We can assume other classes will be 
made. so one I can assume that input is already tokenized, stemmed and stopword 
removed.

 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-06-29 Thread Alok Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606161#comment-14606161
 ] 

Alok Singh commented on SPARK-5571:
---

I would like to work to it if everyone is ok .

 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2015-06-29 Thread Ted Malaska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606166#comment-14606166
 ] 

Ted Malaska commented on SPARK-2447:


Hey Andrew,

https://issues.apache.org/jira/browse/HBASE-13992

Let me know if there is anything else I can do.  I would love this to get into 
HBase.

Let me know if you want to chat off line.

 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6129) Add a section in user guide for model evaluation

2015-06-29 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606165#comment-14606165
 ] 

Seth Hendrickson commented on SPARK-6129:
-

If no one else has started on this, I'd like to give it a go. 

 Add a section in user guide for model evaluation
 

 Key: SPARK-6129
 URL: https://issues.apache.org/jira/browse/SPARK-6129
 Project: Spark
  Issue Type: New Feature
  Components: Documentation, MLlib
Reporter: Xiangrui Meng

 We now have evaluation metrics for binary, multiclass, ranking, and 
 multilabel in MLlib. It would be nice to have a section in the user guide to 
 summarize them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8528) Add applicationId to SparkContext object in pyspark

2015-06-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-8528.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6936
[https://github.com/apache/spark/pull/6936]

 Add applicationId to SparkContext object in pyspark
 ---

 Key: SPARK-8528
 URL: https://issues.apache.org/jira/browse/SPARK-8528
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.4.0
Reporter: Vladimir Vladimirov
Priority: Minor
 Fix For: 1.5.0


 It is available in Scala API.
 Our use case - we want to log applicationId (YARN in hour case) to request 
 help with troubleshooting from the DevOps if our app had failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8528) Add applicationId to SparkContext object in pyspark

2015-06-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-8528:
--
Assignee: Vladimir Vladimirov

 Add applicationId to SparkContext object in pyspark
 ---

 Key: SPARK-8528
 URL: https://issues.apache.org/jira/browse/SPARK-8528
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.4.0
Reporter: Vladimir Vladimirov
Assignee: Vladimir Vladimirov
Priority: Minor
 Fix For: 1.5.0


 It is available in Scala API.
 Our use case - we want to log applicationId (YARN in hour case) to request 
 help with troubleshooting from the DevOps if our app had failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6830) Memoize frequently queried vals in RDD, such as numPartitions, count etc.

2015-06-29 Thread Perinkulam I Ganesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606176#comment-14606176
 ] 

Perinkulam I Ganesh commented on SPARK-6830:


This thought crossed our mind as well earlier. So we were debating whether the 
caching should be implemented within the cacheManager, so that the count is 
cached only if the underlying RDD is cached.

 Memoize frequently queried vals in RDD, such as numPartitions, count etc.
 -

 Key: SPARK-6830
 URL: https://issues.apache.org/jira/browse/SPARK-6830
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Minor
  Labels: Starter

 We should memoize frequently queried vals in RDD, such as numPartitions, 
 count etc.
 While using SparkR in RStudio, the `count` function seems to be called 
 frequently by the IDE – I think this is to show some stats about variables in 
 the workspace etc. but this is not great in SparkR as we trigger a job every 
 time count is called.
 Memoization would help in this case, but we should also see if there is some 
 better way to interact with RStudio.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8670) Nested columns can't be referenced (but they can be selected)

2015-06-29 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606178#comment-14606178
 ] 

Nicholas Chammas commented on SPARK-8670:
-

FYI: `df.stats.age` works neither on 1.3 nor on 1.4. In both cases it yields 
this:

{code}
AttributeError: 'Column' object has no attribute 'age'
{code}

`df.selectExpr(stats.age)` does work, though.

 Nested columns can't be referenced (but they can be selected)
 -

 Key: SPARK-8670
 URL: https://issues.apache.org/jira/browse/SPARK-8670
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Nicholas Chammas

 This is strange and looks like a regression from 1.3.
 {code}
 import json
 daterz = [
   {
 'name': 'Nick',
 'stats': {
   'age': 28
 }
   },
   {
 'name': 'George',
 'stats': {
   'age': 31
 }
   }
 ]
 df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))
 df.select('stats.age').show()
 df['stats.age']  # 1.4 fails on this line
 {code}
 On 1.3 this works and yields:
 {code}
 age
 28 
 31 
 Out[1]: Columnstats.age AS age#2958L
 {code}
 On 1.4, however, this gives an error on the last line:
 {code}
 +---+
 |age|
 +---+
 | 28|
 | 31|
 +---+
 ---
 IndexErrorTraceback (most recent call last)
 ipython-input-1-04bd990e94c6 in module()
  19 
  20 df.select('stats.age').show()
 --- 21 df['stats.age']
 /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
 678 if isinstance(item, basestring):
 679 if item not in self.columns:
 -- 680 raise IndexError(no such column: %s % item)
 681 jc = self._jdf.apply(item)
 682 return Column(jc)
 IndexError: no such column: stats.age
 {code}
 This means, among other things, that you can't join DataFrames on nested 
 columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8621) crosstab exception when one of the value is empty

2015-06-29 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606186#comment-14606186
 ] 

Michael Armbrust commented on SPARK-8621:
-

Will you ever want to access the columns by name?  Having to write 
{{df(\name\)}} s kind of verbose.  I think I would just special case empty 
string as {{empty string}}, but I don't have a strong opinion here.

 crosstab exception when one of the value is empty
 -

 Key: SPARK-8621
 URL: https://issues.apache.org/jira/browse/SPARK-8621
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Critical

 I think this happened because some value is empty.
 {code}
 scala df1.stat.crosstab(role, lang)
 org.apache.spark.sql.AnalysisException: syntax error in attribute name: ;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.parseAttributeName(LogicalPlan.scala:145)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:135)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:157)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:603)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:394)
   at 
 org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:160)
   at 
 org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:157)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:157)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:147)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:132)
   at 
 org.apache.spark.sql.execution.stat.StatFunctions$.crossTabulate(StatFunctions.scala:132)
   at 
 org.apache.spark.sql.DataFrameStatFunctions.crosstab(DataFrameStatFunctions.scala:91)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8709) Exclude hadoop-client's mockito-all dependency to fix test compilation break for Hadoop 2

2015-06-29 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-8709:
-

 Summary: Exclude hadoop-client's mockito-all dependency to fix 
test compilation break for Hadoop 2
 Key: SPARK-8709
 URL: https://issues.apache.org/jira/browse/SPARK-8709
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen


{{build/sbt -Phadoop-1 -Dhadoop.version=2.0.0-mr1-cdh4.1.1 -Phive -Pkinesis-asl 
-Phive-thriftserver core/test:compile}} currently fails to compile:

{code}
[error] 
/Users/joshrosen/Documents/spark/core/src/test/java/org/apache/spark/shuffle/unsafe/UnsafeShuffleWriterSuite.java:117:
 error: cannot find symbol
[error] 
when(shuffleMemoryManager.tryToAcquire(anyLong())).then(returnsFirstArg());
[error]   ^
[error]   symbol:   method then(AnswerObject)
[error]   location: interface OngoingStubbingLong
[error] 
/Users/joshrosen/Documents/spark/core/src/test/java/org/apache/spark/shuffle/unsafe/UnsafeShuffleWriterSuite.java:408:
 error: cannot find symbol
[error]   .then(returnsFirstArg()) // Allocate initial sort buffer
[error]   ^
[error]   symbol:   method then(AnswerObject)
[error]   location: interface OngoingStubbingLong
[error] 
/Users/joshrosen/Documents/spark/core/src/test/java/org/apache/spark/shuffle/unsafe/UnsafeShuffleWriterSuite.java:435:
 error: cannot find symbol
[error]   .then(returnsFirstArg()) // Allocate initial sort buffer
[error]   ^
[error]   symbol:   method then(AnswerObject)
[error]   location: interface OngoingStubbingLong
[error] 3 errors
[error] (core/test:compile) javac returned nonzero exit code
[error] Total time: 60 s, completed Jun 29, 2015 11:03:13 AM
{code}

This is because {{hadoop-client}} pulls in a dependency on {{mockito-all}}, but 
I recently changed Spark to depend on {{mockito-core}} instead, which caused 
Hadoop's earlier Mockito version to take precedence over our newer version. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8668) expr function to convert SQL expression into a Column

2015-06-29 Thread Tarek Auel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606067#comment-14606067
 ] 

Tarek Auel commented on SPARK-8668:
---

Hi,

just to get it right:

selectExpr of the dataframe api takes at the moment varargs as arguments. This 
should be enhanced in order to parse ONE string argument that contains multiple 
expressions, shouldn't it? Or do I get it wrong?

 expr function to convert SQL expression into a Column
 -

 Key: SPARK-8668
 URL: https://issues.apache.org/jira/browse/SPARK-8668
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 selectExpr uses the expression parser to parse a string expressions. would be 
 great to create an expr function in functions.scala/functions.py that 
 converts a string into an expression (or a list of expressions separated by 
 comma).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8668) expr function to convert SQL expression into a Column

2015-06-29 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606073#comment-14606073
 ] 

Reynold Xin commented on SPARK-8668:


This is not about selectExpr, but adding a new expr function that takes in a 
single string, and returns an expression.

Once we do that, we can have expr and selectExpr support taking in one 
string, and returning multiple expressions (wrapped in a wrapper expression 
that the analyzer can expand).


 expr function to convert SQL expression into a Column
 -

 Key: SPARK-8668
 URL: https://issues.apache.org/jira/browse/SPARK-8668
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 selectExpr uses the expression parser to parse a string expressions. would be 
 great to create an expr function in functions.scala/functions.py that 
 converts a string into an expression (or a list of expressions separated by 
 comma).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8410) Hive VersionsSuite RuntimeException

2015-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8410:
---

Assignee: Burak Yavuz  (was: Apache Spark)

 Hive VersionsSuite RuntimeException
 ---

 Key: SPARK-8410
 URL: https://issues.apache.org/jira/browse/SPARK-8410
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1, 1.4.0
 Environment: IBM Power system - P7
 running Ubuntu 14.04LE
Reporter: Josiah Samuel Sathiadass
Assignee: Burak Yavuz
Priority: Minor

 While testing Spark Project Hive, there are RuntimeExceptions as follows,
 VersionsSuite:
 - success sanity check *** FAILED ***
   java.lang.RuntimeException: [download failed: 
 org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed: 
 org.codehaus.groovy#groovy-all;2.1.6!groovy-all.jar, download failed: 
 asm#asm;3.2!asm.jar]
   at 
 org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978)
   at 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62)
   at 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62)
   at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:38)
   at 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$.org$apache$spark$sql$hive$client$IsolatedClientLoader$$downloadVersion(IsolatedClientLoader.scala:61)
   at 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$1.apply(IsolatedClientLoader.scala:44)
   at 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$1.apply(IsolatedClientLoader.scala:44)
   at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189)
   at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
   at 
 org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:44)
   ...
 The tests are executed with the following set of options,
 build/mvn --pl sql/hive --fail-never -Pyarn -Phadoop-2.4 
 -Dhadoop.version=2.6.0 test
 Adding the following dependencies in the spark/sql/hive/pom.xml  file 
 solves this issue,
  dependency
  groupIdorg.jboss.netty/groupId
  artifactIdnetty/artifactId
  version3.2.2.Final/version
  scopetest/scope
  /dependency
  dependency
  groupIdorg.codehaus.groovy/groupId
  artifactIdgroovy-all/artifactId
  version2.1.6/version
  scopetest/scope
  /dependency
  
  dependency
  groupIdasm/groupId
  artifactIdasm/artifactId
  version3.2/version
  scopetest/scope
  /dependency
  
 The question is, Is this the correct way to fix this runtimeException ?
 If yes, Can a pull request fix this issue permanently ?
 If not, suggestions please.
 Updates:
 The above mentioned quick fix is not working with the latest 1.4 because of
 this pull commits :
  [SPARK-8095] Resolve dependencies of --packages in local ivy cache #6788 
 https://github.com/apache/spark/pull/6788
 Due to this above commit, now the lookup directories during testing phase
 has changed as follows,
 :: problems summary ::
  WARNINGS
   [NOT FOUND  ] 
 org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle) (2ms)
    local-m2-cache: tried
 
 file:/home/joe/sparkibmsoe/spark/sql/hive/dummy/.m2/repository/org/jboss/netty/netty/3.2.2.Final/netty-3.2.2.Final.jar
   [NOT FOUND  ] 
 org.codehaus.groovy#groovy-all;2.1.6!groovy-all.jar (0ms)
    local-m2-cache: tried
 
 file:/home/joe/sparkibmsoe/spark/sql/hive/dummy/.m2/repository/org/codehaus/groovy/groovy-all/2.1.6/groovy-all-2.1.6.jar
   [NOT FOUND  ] asm#asm;3.2!asm.jar (0ms)
    local-m2-cache: tried
 
 file:/home/joe/sparkibmsoe/spark/sql/hive/dummy/.m2/repository/asm/asm/3.2/asm-3.2.jar
   ::
   ::  FAILED DOWNLOADS::
   :: ^ see resolution messages for details  ^ ::
   ::
   :: org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle)
   :: org.codehaus.groovy#groovy-all;2.1.6!groovy-all.jar
   :: asm#asm;3.2!asm.jar
   ::



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8475) SparkSubmit with Ivy jars is very slow to load with no internet access

2015-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8475:
---

Assignee: (was: Apache Spark)

 SparkSubmit with Ivy jars is very slow to load with no internet access
 --

 Key: SPARK-8475
 URL: https://issues.apache.org/jira/browse/SPARK-8475
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.4.0
Reporter: Nathan McCarthy
Priority: Minor

 Spark Submit adds maven central  spark bintray to the ChainResolver before 
 it adds any external resolvers. 
 https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L821
 When running on a cluster without internet access, this means the spark shell 
 takes forever to launch as it tries these two remote repos before the ones 
 specified in the --repositories list. In our case we have a proxy which the 
 cluster can access it and supply it via --repositories.
 This is also a problem for users who maintain a proxy for maven/ivy repos 
 with something like Nexus/Artifactory. Having a repo proxy is popular at many 
 organisations so I'd say this would be a useful change for these users as 
 well. In the current state even if a maven central proxy is supplied, it will 
 still try and hit central. 
 I see two options for a fix;
 * Change the order repos are added to the ChainResolver, making the 
 --repositories supplied repos come before anything else. 
 https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L843
  
 * Have a config option (like spark.jars.ivy.useDefaultRemoteRepos, default 
 true) which when false wont add the maven central  bintry to the 
 ChainResolver. 
 Happy to do a PR for this fix. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >