date:20150730

Cody Koeninger created SPARK-9475:
-

 Summary: Consistent hadoop config for external/*
 Key: SPARK-9475
 URL: https://issues.apache.org/jira/browse/SPARK-9475
 Project: Spark
  Issue Type: Sub-task
Reporter: Cody Koeninger
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9473) Consistent hadoop config for SQL

Cody Koeninger created SPARK-9473:
-

 Summary: Consistent hadoop config for SQL
 Key: SPARK-9473
 URL: https://issues.apache.org/jira/browse/SPARK-9473
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cody Koeninger






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9380) Pregel example fix in graphx-programming-guide


 [ 
https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9380:
-
Assignee: Alexander Ulanov

 Pregel example fix in graphx-programming-guide
 --

 Key: SPARK-9380
 URL: https://issues.apache.org/jira/browse/SPARK-9380
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
Assignee: Alexander Ulanov
 Fix For: 1.4.0


 Pregel operator to express single source
 shortest path does not work due to incorrect type of the graph: Graph[Int, 
 Double] should be Graph[Long, Double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-30 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647630#comment-14647630
 ] 

Liang-Chi Hsieh commented on SPARK-9347:


It will merge different schema if the parquet schema merging configuration is 
enabled.

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9472) Consistent hadoop config for streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-9472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9472:
---

Assignee: (was: Apache Spark)

 Consistent hadoop config for streaming
 --

 Key: SPARK-9472
 URL: https://issues.apache.org/jira/browse/SPARK-9472
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Cody Koeninger
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9472) Consistent hadoop config for streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-9472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9472:
---

Assignee: Apache Spark

 Consistent hadoop config for streaming
 --

 Key: SPARK-9472
 URL: https://issues.apache.org/jira/browse/SPARK-9472
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Cody Koeninger
Assignee: Apache Spark
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9447) Update python API to include RandomForest as classifier changes.


 [ 
https://issues.apache.org/jira/browse/SPARK-9447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9447:
-
Component/s: PySpark
 MLlib

 Update python API to include RandomForest as classifier changes.
 

 Key: SPARK-9447
 URL: https://issues.apache.org/jira/browse/SPARK-9447
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: holdenk

 The API should still work after 
 SPARK-9016-make-random-forest-classifiers-implement-classification-trait gets 
 merged in, but we might want to extend  provide predictRaw and similar in 
 the Python API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9377) Shuffle tuning should discuss task size optimisation

2015-07-30 Thread Jem Tucker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647599#comment-14647599
 ] 

Jem Tucker commented on SPARK-9377:
---

Yes I will do

 Shuffle tuning should discuss task size optimisation
 

 Key: SPARK-9377
 URL: https://issues.apache.org/jira/browse/SPARK-9377
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Shuffle
Reporter: Jem Tucker
Priority: Minor

 Recent issue SPARK-9310 highlighted the negative effects of having too high 
 parallelism caused by task overhead. Although large task numbers is 
 unavoidable with high volumes of data, more in detail in the documentation 
 will be very beneficial to newcomers when optimising the performance of their 
 applications.
 Areas to discuss could be:
 - What are the overheads of a Spark task? 
 -- Does this overhead chance with task size etc?
 - How to dynamically calculate a suitable parallelism for a Spark job
 - Examples of designing code to minimise shuffles
 -- How to minimise the data volumes when shuffles are required
 - Differences between sort-based and hash-based shuffles
 -- Benefits and weaknesses of each



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9377) Shuffle tuning should discuss task size optimisation

2015-07-30 Thread Jem Tucker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647600#comment-14647600
 ] 

Jem Tucker commented on SPARK-9377:
---

Yes I will do

 Shuffle tuning should discuss task size optimisation
 

 Key: SPARK-9377
 URL: https://issues.apache.org/jira/browse/SPARK-9377
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Shuffle
Reporter: Jem Tucker
Priority: Minor

 Recent issue SPARK-9310 highlighted the negative effects of having too high 
 parallelism caused by task overhead. Although large task numbers is 
 unavoidable with high volumes of data, more in detail in the documentation 
 will be very beneficial to newcomers when optimising the performance of their 
 applications.
 Areas to discuss could be:
 - What are the overheads of a Spark task? 
 -- Does this overhead chance with task size etc?
 - How to dynamically calculate a suitable parallelism for a Spark job
 - Examples of designing code to minimise shuffles
 -- How to minimise the data volumes when shuffles are required
 - Differences between sort-based and hash-based shuffles
 -- Benefits and weaknesses of each



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-30 Thread Samphel Norden (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647606#comment-14647606
 ] 

Samphel Norden commented on SPARK-9347:
---

One additional question. Assuming schema does evolve and if we have folder 1 
and folder 2 each with a different _common_metadata that represents schema 
evolution, spark will do a merge of the 2 different _common_metadata files? or 
would this not work?

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9472) Consistent hadoop config for streaming

Cody Koeninger created SPARK-9472:
-

 Summary: Consistent hadoop config for streaming
 Key: SPARK-9472
 URL: https://issues.apache.org/jira/browse/SPARK-9472
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Cody Koeninger
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9276) ThriftServer process can't stop if using command yarn application -kill appid


 [ 
https://issues.apache.org/jira/browse/SPARK-9276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9276.
--
Resolution: Not A Problem

Reopen if someone can explain in more detail what the problem is

 ThriftServer process can't stop if using command yarn application -kill 
 appid
 ---

 Key: SPARK-9276
 URL: https://issues.apache.org/jira/browse/SPARK-9276
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: meiyoula

 Reproduction Steps:
 1. starting thriftserver
 2. using beeline to connect thriftserver
 3.using commad “yarn application -kill appid” or from yarn webui to kill the 
 application of thriftserver
 4.ApplicationMaster has stopped, but the driver process will always be there
 Reproduction Condition: There must have client connect to thriftserver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9248) Closing curly-braces should always be on their own line


 [ 
https://issues.apache.org/jira/browse/SPARK-9248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9248:
---

Assignee: (was: Apache Spark)

 Closing curly-braces should always be on their own line
 ---

 Key: SPARK-9248
 URL: https://issues.apache.org/jira/browse/SPARK-9248
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa
Priority: Minor

 Closing curly-braces should always be on their own line
 For example,
 {noformat}
 inst/tests/test_sparkSQL.R:606:3: style: Closing curly-braces should always 
 be on their own line, unless it's followed by an else.
   }, error = function(err) {
   ^
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9248) Closing curly-braces should always be on their own line


 [ 
https://issues.apache.org/jira/browse/SPARK-9248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9248:
---

Assignee: Apache Spark

 Closing curly-braces should always be on their own line
 ---

 Key: SPARK-9248
 URL: https://issues.apache.org/jira/browse/SPARK-9248
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa
Assignee: Apache Spark
Priority: Minor

 Closing curly-braces should always be on their own line
 For example,
 {noformat}
 inst/tests/test_sparkSQL.R:606:3: style: Closing curly-braces should always 
 be on their own line, unless it's followed by an else.
   }, error = function(err) {
   ^
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9248) Closing curly-braces should always be on their own line


[ 
https://issues.apache.org/jira/browse/SPARK-9248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647512#comment-14647512
 ] 

Apache Spark commented on SPARK-9248:
-

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/7795

 Closing curly-braces should always be on their own line
 ---

 Key: SPARK-9248
 URL: https://issues.apache.org/jira/browse/SPARK-9248
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa
Priority: Minor

 Closing curly-braces should always be on their own line
 For example,
 {noformat}
 inst/tests/test_sparkSQL.R:606:3: style: Closing curly-braces should always 
 be on their own line, unless it's followed by an else.
   }, error = function(err) {
   ^
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8978) Implement the DirectKafkaController


 [ 
https://issues.apache.org/jira/browse/SPARK-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8978:
---

Assignee: Apache Spark

 Implement the DirectKafkaController
 ---

 Key: SPARK-8978
 URL: https://issues.apache.org/jira/browse/SPARK-8978
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Iulian Dragos
Assignee: Apache Spark
 Fix For: 1.5.0


 Based on this [design 
 doc|https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing].
 The DirectKafkaInputDStream should use the rate estimate to control how many 
 records/partition to put in the next batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-30 Thread Samphel Norden (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647634#comment-14647634
 ] 

Samphel Norden commented on SPARK-9347:
---

Am trying to get spark to only look at _common_metadata files for 2 different 
schema.
But if the new option is turned on (respect.summarymetadata?) would it merge 
based on different_common_metadata files or would it have to be disabled, and 
we use regular part merging?

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8978) Implement the DirectKafkaController


[ 
https://issues.apache.org/jira/browse/SPARK-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647631#comment-14647631
 ] 

Apache Spark commented on SPARK-8978:
-

User 'dragos' has created a pull request for this issue:
https://github.com/apache/spark/pull/7796

 Implement the DirectKafkaController
 ---

 Key: SPARK-8978
 URL: https://issues.apache.org/jira/browse/SPARK-8978
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Iulian Dragos
 Fix For: 1.5.0


 Based on this [design 
 doc|https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing].
 The DirectKafkaInputDStream should use the rate estimate to control how many 
 records/partition to put in the next batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8978) Implement the DirectKafkaController


 [ 
https://issues.apache.org/jira/browse/SPARK-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8978:
---

Assignee: (was: Apache Spark)

 Implement the DirectKafkaController
 ---

 Key: SPARK-8978
 URL: https://issues.apache.org/jira/browse/SPARK-8978
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Iulian Dragos
 Fix For: 1.5.0


 Based on this [design 
 doc|https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing].
 The DirectKafkaInputDStream should use the rate estimate to control how many 
 records/partition to put in the next batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9472) Consistent hadoop config for streaming


[ 
https://issues.apache.org/jira/browse/SPARK-9472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647665#comment-14647665
 ] 

Apache Spark commented on SPARK-9472:
-

User 'koeninger' has created a pull request for this issue:
https://github.com/apache/spark/pull/7772

 Consistent hadoop config for streaming
 --

 Key: SPARK-9472
 URL: https://issues.apache.org/jira/browse/SPARK-9472
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Cody Koeninger
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4666) executor.memoryOverhead config should take a memory string


 [ 
https://issues.apache.org/jira/browse/SPARK-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4666.
--
Resolution: Won't Fix

I think this timed out and/or got subsumed in another JIRA

 executor.memoryOverhead config should take a memory string
 --

 Key: SPARK-4666
 URL: https://issues.apache.org/jira/browse/SPARK-4666
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Ryan Williams

 This config value currently takes an integer number of megabytes, but it 
 should also be able to parse strings like 1g, the way several other config 
 params do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-9377) Shuffle tuning should discuss task size optimisation

2015-07-30 Thread Jem Tucker (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jem Tucker updated SPARK-9377:
--
Comment: was deleted

(was: Yes I will do)

 Shuffle tuning should discuss task size optimisation
 

 Key: SPARK-9377
 URL: https://issues.apache.org/jira/browse/SPARK-9377
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Shuffle
Reporter: Jem Tucker
Priority: Minor

 Recent issue SPARK-9310 highlighted the negative effects of having too high 
 parallelism caused by task overhead. Although large task numbers is 
 unavoidable with high volumes of data, more in detail in the documentation 
 will be very beneficial to newcomers when optimising the performance of their 
 applications.
 Areas to discuss could be:
 - What are the overheads of a Spark task? 
 -- Does this overhead chance with task size etc?
 - How to dynamically calculate a suitable parallelism for a Spark job
 - Examples of designing code to minimise shuffles
 -- How to minimise the data volumes when shuffles are required
 - Differences between sort-based and hash-based shuffles
 -- Benefits and weaknesses of each



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9474) Consistent hadoop config for core

Cody Koeninger created SPARK-9474:
-

 Summary: Consistent hadoop config for core
 Key: SPARK-9474
 URL: https://issues.apache.org/jira/browse/SPARK-9474
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Cody Koeninger






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9377) Shuffle tuning should discuss task size optimisation


[ 
https://issues.apache.org/jira/browse/SPARK-9377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647502#comment-14647502
 ] 

Sean Owen commented on SPARK-9377:
--

[~jem.tucker] do you want to open a PR that implements these?

 Shuffle tuning should discuss task size optimisation
 

 Key: SPARK-9377
 URL: https://issues.apache.org/jira/browse/SPARK-9377
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Shuffle
Reporter: Jem Tucker
Priority: Minor

 Recent issue SPARK-9310 highlighted the negative effects of having too high 
 parallelism caused by task overhead. Although large task numbers is 
 unavoidable with high volumes of data, more in detail in the documentation 
 will be very beneficial to newcomers when optimising the performance of their 
 applications.
 Areas to discuss could be:
 - What are the overheads of a Spark task? 
 -- Does this overhead chance with task size etc?
 - How to dynamically calculate a suitable parallelism for a Spark job
 - Examples of designing code to minimise shuffles
 -- How to minimise the data volumes when shuffles are required
 - Differences between sort-based and hash-based shuffles
 -- Benefits and weaknesses of each



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9476) Kafka stream loses leader after 2h of operation

2015-07-30 Thread Ruben Ramalho (JIRA)

Ruben Ramalho created SPARK-9476:


 Summary: Kafka stream loses leader after 2h of operation 
 Key: SPARK-9476
 URL: https://issues.apache.org/jira/browse/SPARK-9476
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.1
 Environment: Docker, Centos, Spark standalone, core i7, 8Gb
Reporter: Ruben Ramalho


This seems to happen every 2h, it happens both with the direct stream and 
regular stream, I'm doing window operations over a 1h period (if that can help).

Here's part of the error message:

2015-07-30 13:27:23 WARN  ClientUtils$:89 - Fetching topic metadata with 
correlation id 10 for topics [Set(updates)] from broker 
[id:0,host:192.168.3.23,port:3000] failed
java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:73)
at 
kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72)
at kafka.producer.SyncProducer.send(SyncProducer.scala:113)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93)
at 
kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
2015-07-30 13:27:23 INFO  SyncProducer:68 - Disconnecting from 192.168.3.23:3000
2015-07-30 13:27:23 WARN  ConsumerFetcherManager$LeaderFinderThread:89 - 
[spark-group_81563e123e9f-1438259236988-fc3d82bf-leader-finder-thread], Failed 
to find leader for Set([updates,0])
kafka.common.KafkaException: fetching topic metadata for topics 
[Set(oversight-updates)] from broker 
[ArrayBuffer(id:0,host:192.168.3.23,port:3000)] failed
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:72)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93)
at 
kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
Caused by: java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:73)
at 
kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72)
at kafka.producer.SyncProducer.send(SyncProducer.scala:113)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58)

After the crash I tried to communicate with kafka with a simple scala consumer 
and producer and have no problem at all. Spark tough needs a kafka container 
restart to start normal operaiton. There are no errors on the kafka log, apart 
from an improper closed connection.

I have been trying to solve this problem for days, I suspect this has something 
to do with spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9478) Add class weights to Random Forest

2015-07-30 Thread Patrick Crenshaw (JIRA)

Patrick Crenshaw created SPARK-9478:
---

 Summary: Add class weights to Random Forest
 Key: SPARK-9478
 URL: https://issues.apache.org/jira/browse/SPARK-9478
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 1.4.1
Reporter: Patrick Crenshaw


Currently, this implementation of random forest does not support class weights. 
Class weights are important when there is imbalanced training data or the 
evaluation metric of a classifier is imbalanced (e.g. true positive rate at 
some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9477) Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos).


[ 
https://issues.apache.org/jira/browse/SPARK-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647767#comment-14647767
 ] 

Stacy Pedersen commented on SPARK-9477:
---

Hi Sean, can we not just list it as a Cluster Manager type? For example in 
http://spark.apache.org/docs/latest/cluster-overview.html - and point to the 
IBM Knowledge Center? You guys don't have to document it, just list our product 
as a type since you list Mesos and YARN.  

 Adding IBM Platform Application Service Controller into Spark documentation 
 as a supported Cluster Manager (beside Yarn and Mesos). 
 

 Key: SPARK-9477
 URL: https://issues.apache.org/jira/browse/SPARK-9477
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.1
Reporter: Stacy Pedersen
Priority: Minor
 Fix For: 1.4.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9477) Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos).


[ 
https://issues.apache.org/jira/browse/SPARK-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647767#comment-14647767
 ] 

Stacy Pedersen edited comment on SPARK-9477 at 7/30/15 3:16 PM:


Hi Sean, can we not just list it as a Cluster Manager type? For example in 
http://spark.apache.org/docs/latest/cluster-overview.html - and point to the 
IBM Knowledge Center? It doesn't have to be documented again, maybe just have 
our product listed as a type since you list Mesos and YARN.  


was (Author: stacyp):
Hi Sean, can we not just list it as a Cluster Manager type? For example in 
http://spark.apache.org/docs/latest/cluster-overview.html - and point to the 
IBM Knowledge Center? You guys don't have to document it, just list our product 
as a type since you list Mesos and YARN.  

 Adding IBM Platform Application Service Controller into Spark documentation 
 as a supported Cluster Manager (beside Yarn and Mesos). 
 

 Key: SPARK-9477
 URL: https://issues.apache.org/jira/browse/SPARK-9477
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.1
Reporter: Stacy Pedersen
Priority: Minor
 Fix For: 1.4.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9479) ReceiverTrackerSuite fails for maven build


[ 
https://issues.apache.org/jira/browse/SPARK-9479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647836#comment-14647836
 ] 

Apache Spark commented on SPARK-9479:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/7797

 ReceiverTrackerSuite fails for maven build
 --

 Key: SPARK-9479
 URL: https://issues.apache.org/jira/browse/SPARK-9479
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Tests
Reporter: Shixiong Zhu

 The test failure is here: 
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3109/
 I saw the following exception in the log:
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 
 serialization failed: java.lang.NullPointerException
 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80)
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
 org.apache.spark.SparkContext.broadcast(SparkContext.scala:1297)
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:834)
 {code}
 This exception is because SparkEnv.get returns null.
 I found the maven build is different from the sbt build. The maven build will 
 create all Suite classes at the beginning. `ReceiverTrackerSuite` creates 
 StreamingContext (SparkContext) in the constructor. That means SparkContext 
 is created very early. And the global SparkEnv will be set to null in the 
 previous test. Therefore we saw the above exception when running `Receiver 
 tracker - propagates rate limit` in `ReceiverTrackerSuite`. This test was 
 added recently.
 Note: the previous tests in `ReceiverTrackerSuite` didn't use SparkContext 
 actually, that's why we didn't see such failure before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9477) Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos).


[ 
https://issues.apache.org/jira/browse/SPARK-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647703#comment-14647703
 ] 

Stacy Pedersen commented on SPARK-9477:
---

Here is link to the IBM Knowledge Center with info on Platform Application 
Service Controller - 
http://www-01.ibm.com/support/knowledgecenter/SS3MQL/product_welcome_asc.html
Here is how we integrate currently with Spark - 
http://www-01.ibm.com/support/knowledgecenter/SS3MQL_1.1.0/manage_resources/spark_overview.dita

Here is the link to a free trail version of Platform Application Service 
Controller: 
https://www-01.ibm.com/marketing/iwm/iwm/web/preLogin.do?source=eipasc 

 Adding IBM Platform Application Service Controller into Spark documentation 
 as a supported Cluster Manager (beside Yarn and Mesos). 
 

 Key: SPARK-9477
 URL: https://issues.apache.org/jira/browse/SPARK-9477
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.1
Reporter: Stacy Pedersen
Priority: Minor
 Fix For: 1.4.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9477) Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos).


[ 
https://issues.apache.org/jira/browse/SPARK-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647748#comment-14647748
 ] 

Sean Owen commented on SPARK-9477:
--

The usual question is: does this need to live in Spark docs? if it doesn't live 
in Spark? this sounds like something that's perfectly well documented already.

 Adding IBM Platform Application Service Controller into Spark documentation 
 as a supported Cluster Manager (beside Yarn and Mesos). 
 

 Key: SPARK-9477
 URL: https://issues.apache.org/jira/browse/SPARK-9477
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.1
Reporter: Stacy Pedersen
Priority: Minor
 Fix For: 1.4.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9478) Add class weights to Random Forest

2015-07-30 Thread Patrick Crenshaw (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647772#comment-14647772
 ] 

Patrick Crenshaw commented on SPARK-9478:
-

Similar to this ticket for Logistic Regression 
https://issues.apache.org/jira/browse/SPARK-7685 and this one for SVMWithSGD 
https://issues.apache.org/jira/browse/SPARK-3246

 Add class weights to Random Forest
 --

 Key: SPARK-9478
 URL: https://issues.apache.org/jira/browse/SPARK-9478
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 1.4.1
Reporter: Patrick Crenshaw

 Currently, this implementation of random forest does not support class 
 weights. Class weights are important when there is imbalanced training data 
 or the evaluation metric of a classifier is imbalanced (e.g. true positive 
 rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8998) Collect enough frequent prefixes before projection in PrefixSpan


 [ 
https://issues.apache.org/jira/browse/SPARK-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-8998.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7783
[https://github.com/apache/spark/pull/7783]

 Collect enough frequent prefixes before projection in PrefixSpan
 

 Key: SPARK-8998
 URL: https://issues.apache.org/jira/browse/SPARK-8998
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Zhang JiaJin
 Fix For: 1.5.0

   Original Estimate: 48h
  Remaining Estimate: 48h

 The implementation in SPARK-6487 might have scalability issues when the 
 number of frequent items is very small. In this case, we can generate 
 candidate sets of higher orders using Apriori-like algorithms and count them, 
 until we collect enough prefixes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9479) ReceiverTrackerSuite fails for maven build


 [ 
https://issues.apache.org/jira/browse/SPARK-9479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9479:
---

Assignee: Apache Spark

 ReceiverTrackerSuite fails for maven build
 --

 Key: SPARK-9479
 URL: https://issues.apache.org/jira/browse/SPARK-9479
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Tests
Reporter: Shixiong Zhu
Assignee: Apache Spark

 The test failure is here: 
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3109/
 I saw the following exception in the log:
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 
 serialization failed: java.lang.NullPointerException
 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80)
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
 org.apache.spark.SparkContext.broadcast(SparkContext.scala:1297)
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:834)
 {code}
 This exception is because SparkEnv.get returns null.
 I found the maven build is different from the sbt build. The maven build will 
 create all Suite classes at the beginning. `ReceiverTrackerSuite` creates 
 StreamingContext (SparkContext) in the constructor. That means SparkContext 
 is created very early. And the global SparkEnv will be set to null in the 
 previous test. Therefore we saw the above exception when running `Receiver 
 tracker - propagates rate limit` in `ReceiverTrackerSuite`. This test was 
 added recently.
 Note: the previous tests in `ReceiverTrackerSuite` didn't use SparkContext 
 actually, that's why we didn't see such failure before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6885) Decision trees: predict class probabilities

2015-07-30 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647741#comment-14647741
 ] 

Yanbo Liang commented on SPARK-6885:


[~josephkb]
I create a new version of InformationGainStats called ImpurityStats. It stores 
information gain, impurity, prediction related data all in one data structure 
which make LearningNode simplicity. Meanwhile it simplifies and optimizes 
binsToBestSplit function.
I will fix some trivial issues after your reviews. It looks like code refactor 
in a way.

 Decision trees: predict class probabilities
 ---

 Key: SPARK-6885
 URL: https://issues.apache.org/jira/browse/SPARK-6885
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang

 Under spark.ml, have DecisionTreeClassifier (currently being added) extend 
 ProbabilisticClassifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9477) Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos).


[ 
https://issues.apache.org/jira/browse/SPARK-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647767#comment-14647767
 ] 

Stacy Pedersen edited comment on SPARK-9477 at 7/30/15 3:24 PM:


Hi Sean, can we not just list it as a Cluster Manager type? For example in 
http://spark.apache.org/docs/latest/cluster-overview.html - and point to the 
IBM Knowledge Center? It doesn't have to be documented again, maybe just have 
our product listed as a type since it lists Mesos and YARN. Just a thought :)


was (Author: stacyp):
Hi Sean, can we not just list it as a Cluster Manager type? For example in 
http://spark.apache.org/docs/latest/cluster-overview.html - and point to the 
IBM Knowledge Center? It doesn't have to be documented again, maybe just have 
our product listed as a type since you list Mesos and YARN.  

 Adding IBM Platform Application Service Controller into Spark documentation 
 as a supported Cluster Manager (beside Yarn and Mesos). 
 

 Key: SPARK-9477
 URL: https://issues.apache.org/jira/browse/SPARK-9477
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.1
Reporter: Stacy Pedersen
Priority: Minor
 Fix For: 1.4.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9479) ReceiverTrackerSuite fails for maven build

2015-07-30 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-9479:
---

 Summary: ReceiverTrackerSuite fails for maven build
 Key: SPARK-9479
 URL: https://issues.apache.org/jira/browse/SPARK-9479
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Tests
Reporter: Shixiong Zhu


The test failure is here: 
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3109/

I saw the following exception in the log:
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 
serialization failed: java.lang.NullPointerException
org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80)
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
org.apache.spark.SparkContext.broadcast(SparkContext.scala:1297)
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:834)
{code}

This exception is because SparkEnv.get returns null.

I found the maven build is different from the sbt build. The maven build will 
create all Suite classes at the beginning. `ReceiverTrackerSuite` creates 
StreamingContext (SparkContext) in the constructor. That means SparkContext is 
created very early. And the global SparkEnv will be set to null in the previous 
test. Therefore we saw the above exception when running `Receiver tracker - 
propagates rate limit` in `ReceiverTrackerSuite`. This test was added recently.

Note: the previous tests in `ReceiverTrackerSuite` didn't use SparkContext 
actually, that's why we didn't see such failure before.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9347) spark load of existing parquet files extremely slow if large number of files

2015-07-30 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647673#comment-14647673
 ] 

Liang-Chi Hsieh commented on SPARK-9347:


Actually the newly introduced configuration is working only if the parquet 
schema merging configuration is enabled. So you need to turn both on.

 spark load of existing parquet files extremely slow if large number of files
 

 Key: SPARK-9347
 URL: https://issues.apache.org/jira/browse/SPARK-9347
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Samphel Norden

 When spark sql shell is launched and we point it to a folder containing a 
 large number of parquet files, the sqlContext.parquetFile() command takes a 
 very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5561) Generalize PeriodicGraphCheckpointer for RDDs


 [ 
https://issues.apache.org/jira/browse/SPARK-5561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5561.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7728
[https://github.com/apache/spark/pull/7728]

 Generalize PeriodicGraphCheckpointer for RDDs
 -

 Key: SPARK-5561
 URL: https://issues.apache.org/jira/browse/SPARK-5561
 Project: Spark
  Issue Type: Improvement
  Components: GraphX, MLlib, Spark Core
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
 Fix For: 1.5.0


 PeriodicGraphCheckpointer was introduced for Latent Dirichlet Allocation 
 (LDA), but it could be generalized to work with both Graphs and RDDs.  It 
 should be generalized and moved out of MLlib.
 (For those who are not familiar with it, it tries to automatically handle 
 persisting/unpersisting and checkpointing/removing checkpoint files in a 
 lineage of Graphs.)
 A generalized version might be immediately useful for:
 * RandomForest
 * Streaming
 * GLMs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9477) Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos).

Stacy Pedersen created SPARK-9477:
-

 Summary: Adding IBM Platform Application Service Controller into 
Spark documentation as a supported Cluster Manager (beside Yarn and Mesos). 
 Key: SPARK-9477
 URL: https://issues.apache.org/jira/browse/SPARK-9477
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.1
Reporter: Stacy Pedersen
Priority: Minor
 Fix For: 1.4.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7368) add QR decomposition for RowMatrix


 [ 
https://issues.apache.org/jira/browse/SPARK-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7368.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 5909
[https://github.com/apache/spark/pull/5909]

 add QR decomposition for RowMatrix
 --

 Key: SPARK-7368
 URL: https://issues.apache.org/jira/browse/SPARK-7368
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang
Assignee: yuhao yang
 Fix For: 1.5.0

   Original Estimate: 48h
  Remaining Estimate: 48h

 Add QR decomposition for RowMatrix.
 There's a great distributed algorithm for QR decomposition, which I'm 
 currently referring to.
 Austin R. Benson, David F. Gleich, James Demmel. Direct QR factorizations 
 for tall-and-skinny matrices in MapReduce architectures, 2013 IEEE 
 International Conference on Big Data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9479) ReceiverTrackerSuite fails for maven build


 [ 
https://issues.apache.org/jira/browse/SPARK-9479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9479:
---

Assignee: (was: Apache Spark)

 ReceiverTrackerSuite fails for maven build
 --

 Key: SPARK-9479
 URL: https://issues.apache.org/jira/browse/SPARK-9479
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Tests
Reporter: Shixiong Zhu

 The test failure is here: 
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3109/
 I saw the following exception in the log:
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 
 serialization failed: java.lang.NullPointerException
 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80)
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
 org.apache.spark.SparkContext.broadcast(SparkContext.scala:1297)
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:834)
 {code}
 This exception is because SparkEnv.get returns null.
 I found the maven build is different from the sbt build. The maven build will 
 create all Suite classes at the beginning. `ReceiverTrackerSuite` creates 
 StreamingContext (SparkContext) in the constructor. That means SparkContext 
 is created very early. And the global SparkEnv will be set to null in the 
 previous test. Therefore we saw the above exception when running `Receiver 
 tracker - propagates rate limit` in `ReceiverTrackerSuite`. This test was 
 added recently.
 Note: the previous tests in `ReceiverTrackerSuite` didn't use SparkContext 
 actually, that's why we didn't see such failure before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9469) TungstenSort should not do safe - unsafe conversion itself


[ 
https://issues.apache.org/jira/browse/SPARK-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648309#comment-14648309
 ] 

Apache Spark commented on SPARK-9469:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7803

 TungstenSort should not do safe - unsafe conversion itself
 ---

 Key: SPARK-9469
 URL: https://issues.apache.org/jira/browse/SPARK-9469
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical

 TungstenSort itself assumes input rows are safe rows, and uses a projection 
 to turn the safe rows into UnsafeRows. We should take that part of the logic 
 out of TungstenSort, and let the planner take care of the conversion. In that 
 case, if the input is UnsafeRow already, no conversion is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9454) LDASuite should use vector comparisons


 [ 
https://issues.apache.org/jira/browse/SPARK-9454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9454:
-
Assignee: Feynman Liang

 LDASuite should use vector comparisons
 --

 Key: SPARK-9454
 URL: https://issues.apache.org/jira/browse/SPARK-9454
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Feynman Liang
Assignee: Feynman Liang
Priority: Minor
 Fix For: 1.5.0


 {{LDASuite}}'s OnlineLDAOptimizer one iteration currently compares 
 correctness using hacky string comparisons. We should compare the vectors 
 instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests


 [ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9487:
-
Target Version/s: 1.5.0

 Use the same num. worker threads in Scala/Python unit tests
 ---

 Key: SPARK-9487
 URL: https://issues.apache.org/jira/browse/SPARK-9487
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core, SQL, Tests
Affects Versions: 1.5.0
Reporter: Xiangrui Meng

 In Python we use `local[4]` for unit tests, while in Scala/Java we use 
 `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
 components. If the operation depends on partition IDs, e.g., random number 
 generator, this will lead to different result in Python and Scala/Java. It 
 would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9484) Word2Vec import/export for original binary format

Joseph K. Bradley created SPARK-9484:


 Summary: Word2Vec import/export for original binary format
 Key: SPARK-9484
 URL: https://issues.apache.org/jira/browse/SPARK-9484
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor


It would be nice to add model import/export for Word2Vec which handles the 
original binary format used by [https://code.google.com/p/word2vec/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec


[ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648180#comment-14648180
 ] 

Joseph K. Bradley commented on SPARK-5692:
--

This was not, but thanks for the reminder; it'd be nice to add.  I'll make and 
link a JIRA for it

 Model import/export for Word2Vec
 

 Key: SPARK-5692
 URL: https://issues.apache.org/jira/browse/SPARK-5692
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Manoj Kumar
 Fix For: 1.4.0


 Supoort save and load for Word2VecModel. We may want to discuss whether we 
 want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7583) User guide update for RegexTokenizer


[ 
https://issues.apache.org/jira/browse/SPARK-7583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648182#comment-14648182
 ] 

Joseph K. Bradley commented on SPARK-7583:
--

Yes, please!  This can go in after the feature freeze.

 User guide update for RegexTokenizer
 

 Key: SPARK-7583
 URL: https://issues.apache.org/jira/browse/SPARK-7583
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}
 Note: I created a new subsection for links to spark.ml-specific guides in 
 this JIRA's PR: [SPARK-7557]. This transformer can go within the new 
 subsection. I'll try to get that PR merged ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9488) pyspark.sql.types.Row very slow when used named arguments

2015-07-30 Thread Alexis Benoist (JIRA)

Alexis Benoist created SPARK-9488:
-

 Summary: pyspark.sql.types.Row very slow when used named arguments
 Key: SPARK-9488
 URL: https://issues.apache.org/jira/browse/SPARK-9488
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
 Environment: 

Reporter: Alexis Benoist


We can see that the implementation of the Row is accessing items in O(n).
https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1217
We could use an OrderedDict instead of a tuple to make the access time in O(1). 
Can the keys be of an unhashable type?

I'm ok to do the edit.

Cheers,
Alexis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9488) pyspark.sql.types.Row very slow when using named arguments

2015-07-30 Thread Alexis Benoist (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis Benoist updated SPARK-9488:
--
Summary: pyspark.sql.types.Row very slow when using named arguments  (was: 
pyspark.sql.types.Row very slow when used named arguments)

 pyspark.sql.types.Row very slow when using named arguments
 --

 Key: SPARK-9488
 URL: https://issues.apache.org/jira/browse/SPARK-9488
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
 Environment: 
Reporter: Alexis Benoist
  Labels: performance

 We can see that the implementation of the Row is accessing items in O(n).
 https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1217
 We could use an OrderedDict instead of a tuple to make the access time in 
 O(1). Can the keys be of an unhashable type?
 I'm ok to do the edit.
 Cheers,
 Alexis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8197) date/time function: trunc


[ 
https://issues.apache.org/jira/browse/SPARK-8197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648351#comment-14648351
 ] 

Apache Spark commented on SPARK-8197:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/7805

 date/time function: trunc
 -

 Key: SPARK-8197
 URL: https://issues.apache.org/jira/browse/SPARK-8197
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 trunc(string date[, string format]): string
 trunc(date date[, string format]): date
 Returns date truncated to the unit specified by the format (as of Hive 
 1.2.0). Supported formats: MONTH/MON/MM, YEAR//YY. If format is omitted 
 the date will be truncated to the nearest day. Example: trunc('2015-03-17', 
 'MM') = 2015-03-01.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark


[ 
https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648192#comment-14648192
 ] 

Joseph K. Bradley commented on SPARK-6227:
--

That's great you're interested.  Please read this for lots of helpful info: 
[https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark]

I would download the original source code from the Apache Spark website and 
install it natively, without using the VM.  There are instructions for that in 
the Spark docs and READMEs.  To get started, I recommend finding some small 
JIRAs which have been resolved already and looking at the PRs which solved 
them.  Those will give you an idea of the code structure.  Good luck!

 PCA and SVD for PySpark
 ---

 Key: SPARK-6227
 URL: https://issues.apache.org/jira/browse/SPARK-6227
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.2.1
Reporter: Julien Amelot

 The Dimensionality Reduction techniques are not available via Python (Scala + 
 Java only).
 * Principal component analysis (PCA)
 * Singular value decomposition (SVD)
 Doc:
 http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6684) Add checkpointing to GradientBoostedTrees


 [ 
https://issues.apache.org/jira/browse/SPARK-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6684:
-
Shepherd: Xiangrui Meng

 Add checkpointing to GradientBoostedTrees
 -

 Key: SPARK-6684
 URL: https://issues.apache.org/jira/browse/SPARK-6684
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 We should add checkpointing to GradientBoostedTrees since it maintains RDDs 
 with long lineages.
 keywords: gradient boosting, gbt, gradient boosted trees



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8176) date/time function: to_date


[ 
https://issues.apache.org/jira/browse/SPARK-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648350#comment-14648350
 ] 

Apache Spark commented on SPARK-8176:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/7805

 date/time function: to_date
 ---

 Key: SPARK-8176
 URL: https://issues.apache.org/jira/browse/SPARK-8176
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Adrian Wang

 parse a timestamp string and return the date portion
 {code}
 to_date(string timestamp): date
 {code}
 Returns the date part of a timestamp string: to_date(1970-01-01 00:00:00) = 
 1970-01-01 (in some date format)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9408) Refactor mllib/linalg.py to mllib/linalg


 [ 
https://issues.apache.org/jira/browse/SPARK-9408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9408:
-
Shepherd: Davies Liu  (was: Xiangrui Meng)

 Refactor mllib/linalg.py to mllib/linalg
 

 Key: SPARK-9408
 URL: https://issues.apache.org/jira/browse/SPARK-9408
 Project: Spark
  Issue Type: Task
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Assignee: Manoj Kumar

 We need to refactor mllib/linalg.py to mllib/linalg so that the project 
 structure is similar to that of Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client


 [ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-9485:

Description: 
Spark-submit throws an exception when connecting to yarn but it works when  
used in standalone mode.

I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
got the same exception below.

spark-submit --master yarn-client

Here is a stack trace of the exception:

15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
executors
Exception in thread Yarn application state monitor 
org.apache.spark.SparkException: Error asking standalone schedule
r to shut down executors
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
d.scala:261)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
66)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
139)
Caused by: java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java
:1326)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
a:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

java.lang.NullPointerException
at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193)
at 
org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033)
at $iwC$$iwC.init(console:9)
at $iwC.init(console:18)
at init(console:20)
at .init(console:24)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at

[jira] [Commented] (SPARK-967) start-slaves.sh uses local path from master on remote slave nodes

2015-07-30 Thread David Chin (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648234#comment-14648234
]

David Chin commented on SPARK-967:
--

I won't create a pull request unless asked to, but I have a solution for this.
I am running Spark in standalone mode within a Univa Grid Engine cluster. As
such, configs and logs, etc should be specific to each UGE job, identified by
an integer job ID.

Currently, any environment variables on the master are not passed along by the
sbin/start-slaves.sh invocation of ssh. I put in a fix on my local version,
which works. However, this is still less than ideal in that UGE's job
accounting cannot keep track of resource usage by jobs not under its process
tree. Not sure, yet, what the correct solution is. I thought I saw a feature
request to allow other remote shell programs besides ssh, but I can't find it
now.

Please see my version of sbin/start-slaves.sh here:
https://github.com/prehensilecode/spark/blob/master/sbin/start-slaves.sh

start-slaves.sh uses local path from master on remote slave nodes
-

Key: SPARK-967
URL: https://issues.apache.org/jira/browse/SPARK-967
Project: Spark
Issue Type: Bug
Components: Deploy
Affects Versions: 0.8.0, 0.8.1, 0.9.0
Reporter: Evgeniy Tsvigun
Priority: Trivial
Labels: script, starter

If a slave node has home path other than master, start-slave.sh fails to
start a worker instance, for other nodes behaves as expected, in my case:
$ ./bin/start-slaves.sh
node05.dev.vega.ru: bash: line 0: cd: /usr/home/etsvigun/spark/bin/..: No
such file or directory
node04.dev.vega.ru: org.apache.spark.deploy.worker.Worker running as
process 4796. Stop it first.
node03.dev.vega.ru: org.apache.spark.deploy.worker.Worker running as
process 61348. Stop it first.
I don't mention /usr/home anywhere, the only environment variable I set is
$SPARK_HOME, relative to $HOME on every node, which makes me think some
script takes `pwd` on master and tries to use it on slaves.
Spark version: fb6875dd5c9334802580155464cef9ac4d4cc1f0
OS: FreeBSD 8.4

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client


[ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648233#comment-14648233
 ] 

Philip Adetiloye commented on SPARK-9485:
-

[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
SPARK_YARN_QUEUE=dev

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:

export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
s:bin/java::)
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS=-Djava.library.path=$HADOOP_HOME/lib
export HADOOP_OPTS=$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH


Hope this helps.

Thanks,
- Phil

 Failed to connect to yarn / spark-submit --master yarn-client
 -

 Key: SPARK-9485
 URL: https://issues.apache.org/jira/browse/SPARK-9485
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Spark Submit, YARN
Affects Versions: 1.4.1
 Environment: DEV
Reporter: Philip Adetiloye
Priority: Minor

 Spark-submit throws an exception when connecting to yarn but it works when  
 used in standalone mode.
 I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
 got the same exception below.
 spark-submit --master yarn-client
 Here is a stack trace of the exception:
 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
 executors
 Exception in thread Yarn application state monitor 
 org.apache.spark.SparkException: Error asking standalone schedule
 r to shut down executors
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
 d.scala:261)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
 66)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
 at 
 org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
 139)
 Caused by: java.lang.InterruptedException
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java
 :1326)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at 
 org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
 a:945)
 at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
 at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

[jira] [Updated] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client


 [ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9485:
-
Shepherd:   (was: MEN CHAMROEUN)
Target Version/s:   (was: 1.4.1)
 Environment: (was: DEV)

Please review 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark  -- 
this JIRA had some fields set that should not be.

I don't think that helps since it's just a list of your local configs, specific 
to your environment. Obivously, in general yarn-client mode does not yield a 
failure on startup so this isn't quite helpful in understanding the failure. It 
seems specific to your env.

 Failed to connect to yarn / spark-submit --master yarn-client
 -

 Key: SPARK-9485
 URL: https://issues.apache.org/jira/browse/SPARK-9485
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Spark Submit, YARN
Affects Versions: 1.4.1
Reporter: Philip Adetiloye
Priority: Minor

 Spark-submit throws an exception when connecting to yarn but it works when  
 used in standalone mode.
 I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
 got the same exception below.
 spark-submit --master yarn-client
 Here is a stack trace of the exception:
 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
 executors
 Exception in thread Yarn application state monitor 
 org.apache.spark.SparkException: Error asking standalone schedule
 r to shut down executors
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
 d.scala:261)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
 66)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
 at 
 org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
 139)
 Caused by: java.lang.InterruptedException
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java
 :1326)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at 
 org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
 a:945)
 at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
 at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
 at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 java.lang.NullPointerException
 at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193)
 at 
 org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033)
 at $iwC$$iwC.init(console:9)
 at $iwC.init(console:18)
 at init(console:20)
 at .init(console:24)

[jira] [Comment Edited] (SPARK-967) start-slaves.sh uses local path from master on remote slave nodes

2015-07-30 Thread David Chin (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648234#comment-14648234
]

David Chin edited comment on SPARK-967 at 7/30/15 8:34 PM:
---

Please see my version of sbin/start-slaves.sh here, forked from current master:
https://github.com/prehensilecode/spark/blob/master/sbin/start-slaves.sh

was (Author: prehensilecode):
I won't create a pull request unless asked to, but I have a solution for this.
I am running Spark in standalone mode within a Univa Grid Engine cluster. As
such, configs and logs, etc should be specific to each UGE job, identified by
an integer job ID.

Please see my version of sbin/start-slaves.sh here:
https://github.com/prehensilecode/spark/blob/master/sbin/start-slaves.sh

start-slaves.sh uses local path from master on remote slave nodes
-

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9486) Add aliasing to data sources to allow external packages to register themselves with Spark


 [ 
https://issues.apache.org/jira/browse/SPARK-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9486:
---

Assignee: Apache Spark

 Add aliasing to data sources to allow external packages to register 
 themselves with Spark
 -

 Key: SPARK-9486
 URL: https://issues.apache.org/jira/browse/SPARK-9486
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Joseph Batchik
Assignee: Apache Spark
Priority: Minor

 Currently Spark allows users to use external data sources like spark-avro, 
 spark-csv, etc by having them specifying their full class name:
 {code:java}
 sqlContext.read.format(com.databricks.spark.avro).load(path)
 {code}
 Typing in a full class is not the best idea so it would be nice to allow the 
 external packages to be able to register themselves with Spark to allow users 
 to do something like:
 {code:java}
 sqlContext.read.format(avro).load(path)
 {code}
 This would make it so that the external data source packages follow the same 
 convention as the built in data sources do, parquet, json, jdbc, etc.
 This could be accomplished by using a ServiceLoader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9486) Add aliasing to data sources to allow external packages to register themselves with Spark


 [ 
https://issues.apache.org/jira/browse/SPARK-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9486:
---

Assignee: (was: Apache Spark)

 Add aliasing to data sources to allow external packages to register 
 themselves with Spark
 -

 Key: SPARK-9486
 URL: https://issues.apache.org/jira/browse/SPARK-9486
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Joseph Batchik
Priority: Minor

 Currently Spark allows users to use external data sources like spark-avro, 
 spark-csv, etc by having them specifying their full class name:
 {code:java}
 sqlContext.read.format(com.databricks.spark.avro).load(path)
 {code}
 Typing in a full class is not the best idea so it would be nice to allow the 
 external packages to be able to register themselves with Spark to allow users 
 to do something like:
 {code:java}
 sqlContext.read.format(avro).load(path)
 {code}
 This would make it so that the external data source packages follow the same 
 convention as the built in data sources do, parquet, json, jdbc, etc.
 This could be accomplished by using a ServiceLoader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9485) Failed to connect to yarn

Philip Adetiloye created SPARK-9485:
---

 Summary: Failed to connect to yarn
 Key: SPARK-9485
 URL: https://issues.apache.org/jira/browse/SPARK-9485
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Spark Submit, YARN
Affects Versions: 1.4.1
 Environment: DEV
Reporter: Philip Adetiloye
Priority: Minor


Spark-submit throws an exception when connecting to yarn but it works when  
used in standalone mode.

I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
got the same exception below.

Here is a stack trace of the exception:

15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
executors
Exception in thread Yarn application state monitor 
org.apache.spark.SparkException: Error asking standalone schedule
r to shut down executors
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
d.scala:261)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
66)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
139)
Caused by: java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java
:1326)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
a:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

java.lang.NullPointerException
at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193)
at 
org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033)
at $iwC$$iwC.init(console:9)
at $iwC.init(console:18)
at init(console:20)
at .init(console:24)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at

[jira] [Comment Edited] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client


[ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648233#comment-14648233
 ] 

Philip Adetiloye edited comment on SPARK-9485 at 7/30/15 8:16 PM:
--

[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
SPARK_YARN_QUEUE=dev

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:

`
export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
s:bin/java::)
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS=-Djava.library.path=$HADOOP_HOME/lib
export HADOOP_OPTS=$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH

`
Hope this helps.

Thanks,
- Phil


was (Author: pkadetiloye):
[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
SPARK_YARN_QUEUE=dev

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:

export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
s:bin/java::)
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS=-Djava.library.path=$HADOOP_HOME/lib
export HADOOP_OPTS=$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH


Hope this helps.

Thanks,
- Phil

 Failed to connect to yarn / spark-submit --master yarn-client
 -

 Key: SPARK-9485
 URL: https://issues.apache.org/jira/browse/SPARK-9485
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Spark Submit, YARN
Affects Versions: 1.4.1
 Environment: DEV
Reporter: Philip Adetiloye
Priority: Minor

 Spark-submit throws an exception when connecting to yarn but it works when  
 used in standalone mode.
 I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
 got the same exception below.
 spark-submit --master yarn-client
 Here is a stack trace of the exception:
 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
 executors
 Exception in thread Yarn application state monitor 
 org.apache.spark.SparkException: Error asking standalone schedule
 r to shut down executors
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
 d.scala:261)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
 66)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
 at 
 org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
 139)
 Caused by: java.lang.InterruptedException
 at

[jira] [Comment Edited] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client


[ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648233#comment-14648233
 ] 

Philip Adetiloye edited comment on SPARK-9485 at 7/30/15 8:16 PM:
--

[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
SPARK_YARN_QUEUE=dev

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:


export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
s:bin/java::)
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS=-Djava.library.path=$HADOOP_HOME/lib
export HADOOP_OPTS=$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH


Hope this helps.

Thanks,
- Phil


was (Author: pkadetiloye):
[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
SPARK_YARN_QUEUE=dev

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:

`
export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
s:bin/java::)
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS=-Djava.library.path=$HADOOP_HOME/lib
export HADOOP_OPTS=$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH

`
Hope this helps.

Thanks,
- Phil

 Failed to connect to yarn / spark-submit --master yarn-client
 -

 Key: SPARK-9485
 URL: https://issues.apache.org/jira/browse/SPARK-9485
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Spark Submit, YARN
Affects Versions: 1.4.1
 Environment: DEV
Reporter: Philip Adetiloye
Priority: Minor

 Spark-submit throws an exception when connecting to yarn but it works when  
 used in standalone mode.
 I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
 got the same exception below.
 spark-submit --master yarn-client
 Here is a stack trace of the exception:
 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
 executors
 Exception in thread Yarn application state monitor 
 org.apache.spark.SparkException: Error asking standalone schedule
 r to shut down executors
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
 d.scala:261)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
 66)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
 at 
 org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
 139)
 Caused by: java.lang.InterruptedException
 at

[jira] [Commented] (SPARK-9486) Add aliasing to data sources to allow external packages to register themselves with Spark


[ 
https://issues.apache.org/jira/browse/SPARK-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648287#comment-14648287
 ] 

Apache Spark commented on SPARK-9486:
-

User 'JDrit' has created a pull request for this issue:
https://github.com/apache/spark/pull/7802

 Add aliasing to data sources to allow external packages to register 
 themselves with Spark
 -

 Key: SPARK-9486
 URL: https://issues.apache.org/jira/browse/SPARK-9486
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Joseph Batchik
Priority: Minor

 Currently Spark allows users to use external data sources like spark-avro, 
 spark-csv, etc by having them specifying their full class name:
 {code:java}
 sqlContext.read.format(com.databricks.spark.avro).load(path)
 {code}
 Typing in a full class is not the best idea so it would be nice to allow the 
 external packages to be able to register themselves with Spark to allow users 
 to do something like:
 {code:java}
 sqlContext.read.format(avro).load(path)
 {code}
 This would make it so that the external data source packages follow the same 
 convention as the built in data sources do, parquet, json, jdbc, etc.
 This could be accomplished by using a ServiceLoader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6684) Add checkpointing to GradientBoostedTrees


[ 
https://issues.apache.org/jira/browse/SPARK-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648310#comment-14648310
 ] 

Apache Spark commented on SPARK-6684:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/7804

 Add checkpointing to GradientBoostedTrees
 -

 Key: SPARK-6684
 URL: https://issues.apache.org/jira/browse/SPARK-6684
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 We should add checkpointing to GradientBoostedTrees since it maintains RDDs 
 with long lineages.
 keywords: gradient boosting, gbt, gradient boosted trees



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6684) Add checkpointing to GradientBoostedTrees


 [ 
https://issues.apache.org/jira/browse/SPARK-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6684:
---

Assignee: Apache Spark  (was: Joseph K. Bradley)

 Add checkpointing to GradientBoostedTrees
 -

 Key: SPARK-6684
 URL: https://issues.apache.org/jira/browse/SPARK-6684
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Apache Spark

 We should add checkpointing to GradientBoostedTrees since it maintains RDDs 
 with long lineages.
 keywords: gradient boosting, gbt, gradient boosted trees



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6684) Add checkpointing to GradientBoostedTrees


 [ 
https://issues.apache.org/jira/browse/SPARK-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6684:
---

Assignee: Joseph K. Bradley  (was: Apache Spark)

 Add checkpointing to GradientBoostedTrees
 -

 Key: SPARK-6684
 URL: https://issues.apache.org/jira/browse/SPARK-6684
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 We should add checkpointing to GradientBoostedTrees since it maintains RDDs 
 with long lineages.
 keywords: gradient boosting, gbt, gradient boosted trees



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5567) Add prediction methods to LDA


 [ 
https://issues.apache.org/jira/browse/SPARK-5567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5567:
-
Assignee: Feynman Liang

 Add prediction methods to LDA
 -

 Key: SPARK-5567
 URL: https://issues.apache.org/jira/browse/SPARK-5567
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Feynman Liang
   Original Estimate: 168h
  Remaining Estimate: 168h

 LDA currently supports prediction on the training set.  E.g., you can call 
 logLikelihood and topicDistributions to get that info for the training data.  
 However, it should support the same functionality for new (test) documents.
 This will require inference but should be able to use the same code, with a 
 few modification to keep the inferred topics fixed.
 Note: The API for these methods is already in the code but is commented out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client


[ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648233#comment-14648233
 ] 

Philip Adetiloye edited comment on SPARK-9485 at 7/30/15 8:17 PM:
--

[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
SPARK_YARN_QUEUE=dev

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:


export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
s:bin/java::)
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS=-Djava.library.path=$HADOOP_HOME/lib
export HADOOP_OPTS=$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH


Hope this helps.

Thanks,
 Phil


was (Author: pkadetiloye):
[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
SPARK_YARN_QUEUE=dev

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:


export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
s:bin/java::)
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS=-Djava.library.path=$HADOOP_HOME/lib
export HADOOP_OPTS=$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH


Hope this helps.

Thanks,
- Phil

 Failed to connect to yarn / spark-submit --master yarn-client
 -

 Key: SPARK-9485
 URL: https://issues.apache.org/jira/browse/SPARK-9485
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Spark Submit, YARN
Affects Versions: 1.4.1
 Environment: DEV
Reporter: Philip Adetiloye
Priority: Minor

 Spark-submit throws an exception when connecting to yarn but it works when  
 used in standalone mode.
 I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
 got the same exception below.
 spark-submit --master yarn-client
 Here is a stack trace of the exception:
 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
 executors
 Exception in thread Yarn application state monitor 
 org.apache.spark.SparkException: Error asking standalone schedule
 r to shut down executors
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
 d.scala:261)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
 66)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
 at 
 org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
 139)
 Caused by: java.lang.InterruptedException
 at

[jira] [Resolved] (SPARK-5567) Add prediction methods to LDA


 [ 
https://issues.apache.org/jira/browse/SPARK-5567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-5567.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7760
[https://github.com/apache/spark/pull/7760]

 Add prediction methods to LDA
 -

 Key: SPARK-5567
 URL: https://issues.apache.org/jira/browse/SPARK-5567
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Feynman Liang
 Fix For: 1.5.0

   Original Estimate: 168h
  Remaining Estimate: 168h

 LDA currently supports prediction on the training set.  E.g., you can call 
 logLikelihood and topicDistributions to get that info for the training data.  
 However, it should support the same functionality for new (test) documents.
 This will require inference but should be able to use the same code, with a 
 few modification to keep the inferred topics fixed.
 Note: The API for these methods is already in the code but is commented out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9133) Add and Subtract should support date/timestamp and interval type


 [ 
https://issues.apache.org/jira/browse/SPARK-9133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9133.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7754
[https://github.com/apache/spark/pull/7754]

 Add and Subtract should support date/timestamp and interval type
 

 Key: SPARK-9133
 URL: https://issues.apache.org/jira/browse/SPARK-9133
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu
 Fix For: 1.5.0


 Should support
 date + interval
 interval + date
 timestamp + interval
 interval + timestamp
 The best way to support this is probably to resolve this to a date 
 add/substract expression, rather than making add/subtract support these types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8194) date/time function: add_months


 [ 
https://issues.apache.org/jira/browse/SPARK-8194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8194.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7754
[https://github.com/apache/spark/pull/7754]

 date/time function: add_months
 --

 Key: SPARK-8194
 URL: https://issues.apache.org/jira/browse/SPARK-8194
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
 Fix For: 1.5.0


 add_months(string start_date, int num_months): string
 add_months(date start_date, int num_months): date
 Returns the date that is num_months after start_date. The time part of 
 start_date is ignored. If start_date is the last day of the month or if the 
 resulting month has fewer days than the day component of start_date, then the 
 result is the last day of the resulting month. Otherwise, the result has the 
 same day component as start_date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8186) date/time function: date_add


 [ 
https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8186.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7754
[https://github.com/apache/spark/pull/7754]

 date/time function: date_add
 

 Key: SPARK-8186
 URL: https://issues.apache.org/jira/browse/SPARK-8186
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Adrian Wang
 Fix For: 1.5.0


 date_add(timestamp startdate, int days): timestamp
 date_add(timestamp startdate, interval i): timestamp
 date_add(date date, int days): date
 date_add(date date, interval i): date



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8187) date/time function: date_sub


 [ 
https://issues.apache.org/jira/browse/SPARK-8187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8187.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7754
[https://github.com/apache/spark/pull/7754]

 date/time function: date_sub
 

 Key: SPARK-8187
 URL: https://issues.apache.org/jira/browse/SPARK-8187
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Adrian Wang
 Fix For: 1.5.0


 date_sub(timestamp startdate, int days): timestamp
 date_sub(timestamp startdate, interval i): timestamp
 date_sub(date date, int days): date
 date_sub(date date, interval i): date



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9290) DateExpressionsSuite is slow to run


 [ 
https://issues.apache.org/jira/browse/SPARK-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9290.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7754
[https://github.com/apache/spark/pull/7754]

 DateExpressionsSuite is slow to run
 ---

 Key: SPARK-9290
 URL: https://issues.apache.org/jira/browse/SPARK-9290
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
 Fix For: 1.5.0


 We are running way too many test cases in here.
 {code}
 [info] - DayOfYear (16 seconds, 998 milliseconds)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8198) date/time function: months_between


 [ 
https://issues.apache.org/jira/browse/SPARK-8198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8198.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7754
[https://github.com/apache/spark/pull/7754]

 date/time function: months_between
 --

 Key: SPARK-8198
 URL: https://issues.apache.org/jira/browse/SPARK-8198
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
 Fix For: 1.5.0


 months_between(date1, date2): double
 Returns number of months between dates date1 and date2 (as of Hive 1.2.0). If 
 date1 is later than date2, then the result is positive. If date1 is earlier 
 than date2, then the result is negative. If date1 and date2 are either the 
 same days of the month or both last days of months, then the result is always 
 an integer. Otherwise the UDF calculates the fractional portion of the result 
 based on a 31-day month and considers the difference in time components date1 
 and date2. date1 and date2 type can be date, timestamp or string in the 
 format '-MM-dd' or '-MM-dd HH:mm:ss'. The result is rounded to 8 
 decimal places. Example: months_between('1997-02-28 10:30:00', '1996-10-30') 
 = 3.94959677



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9478) Add class weights to Random Forest


[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648184#comment-14648184
 ] 

Joseph K. Bradley commented on SPARK-9478:
--

This sounds valuable.  Handling it by reweighting examples (as is being done 
for logreg) seems like the simplest solution for now.  I'll keep an eye on the 
ticket!

 Add class weights to Random Forest
 --

 Key: SPARK-9478
 URL: https://issues.apache.org/jira/browse/SPARK-9478
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 1.4.1
Reporter: Patrick Crenshaw

 Currently, this implementation of random forest does not support class 
 weights. Class weights are important when there is imbalanced training data 
 or the evaluation metric of a classifier is imbalanced (e.g. true positive 
 rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client


[ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648200#comment-14648200
 ] 

Sean Owen commented on SPARK-9485:
--

I don't think this is sufficient to be a JIRA bug report; there's no detail for 
reproducing it. It also just appears to be some kind of (other) error at 
startup causing initialization to fail. Can you start on user@ please? and if 
there isn't guidance there, provide a consistent reproduction?

 Failed to connect to yarn / spark-submit --master yarn-client
 -

 Key: SPARK-9485
 URL: https://issues.apache.org/jira/browse/SPARK-9485
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Spark Submit, YARN
Affects Versions: 1.4.1
 Environment: DEV
Reporter: Philip Adetiloye
Priority: Minor

 Spark-submit throws an exception when connecting to yarn but it works when  
 used in standalone mode.
 I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
 got the same exception below.
 spark-submit --master yarn-client
 Here is a stack trace of the exception:
 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
 executors
 Exception in thread Yarn application state monitor 
 org.apache.spark.SparkException: Error asking standalone schedule
 r to shut down executors
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
 d.scala:261)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
 66)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
 at 
 org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
 139)
 Caused by: java.lang.InterruptedException
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java
 :1326)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at 
 org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
 a:945)
 at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
 at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
 at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 java.lang.NullPointerException
 at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193)
 at 
 org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033)
 at $iwC$$iwC.init(console:9)
 at $iwC.init(console:18)
 at init(console:20)
 at .init(console:24)
 at .clinit(console)
 at .init(console:7)
 at .clinit(console)
 at $print(console)
 at

[jira] [Updated] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client


 [ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-9485:

Shepherd: MEN CHAMROEUN

 Failed to connect to yarn / spark-submit --master yarn-client
 -

 Key: SPARK-9485
 URL: https://issues.apache.org/jira/browse/SPARK-9485
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Spark Submit, YARN
Affects Versions: 1.4.1
 Environment: DEV
Reporter: Philip Adetiloye
Priority: Minor

 Spark-submit throws an exception when connecting to yarn but it works when  
 used in standalone mode.
 I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
 got the same exception below.
 spark-submit --master yarn-client
 Here is a stack trace of the exception:
 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
 executors
 Exception in thread Yarn application state monitor 
 org.apache.spark.SparkException: Error asking standalone schedule
 r to shut down executors
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
 d.scala:261)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
 66)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
 at 
 org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
 139)
 Caused by: java.lang.InterruptedException
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java
 :1326)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at 
 org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
 a:945)
 at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
 at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
 at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 java.lang.NullPointerException
 at org.apache.spark.sql.SQLContext.init(SQLContext.scala:193)
 at 
 org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033)
 at $iwC$$iwC.init(console:9)
 at $iwC.init(console:18)
 at init(console:20)
 at .init(console:24)
 at .clinit(console)
 at .init(console:7)
 at .clinit(console)
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at

[jira] [Updated] (SPARK-9481) LocalLDAModel logLikelihood


 [ 
https://issues.apache.org/jira/browse/SPARK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9481:
-
Shepherd: Joseph K. Bradley
Assignee: Feynman Liang

 LocalLDAModel logLikelihood
 ---

 Key: SPARK-9481
 URL: https://issues.apache.org/jira/browse/SPARK-9481
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Feynman Liang
Assignee: Feynman Liang
Priority: Trivial

 We already have a variational {{bound}} method so we should provide a public 
 {{logLikelihood}} that uses the model's parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9454) LDASuite should use vector comparisons


 [ 
https://issues.apache.org/jira/browse/SPARK-9454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9454:
-
Shepherd: Joseph K. Bradley

 LDASuite should use vector comparisons
 --

 Key: SPARK-9454
 URL: https://issues.apache.org/jira/browse/SPARK-9454
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Feynman Liang
Assignee: Feynman Liang
Priority: Minor
 Fix For: 1.5.0


 {{LDASuite}}'s OnlineLDAOptimizer one iteration currently compares 
 correctness using hacky string comparisons. We should compare the vectors 
 instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9458) Avoid object allocation in prefix generation


[ 
https://issues.apache.org/jira/browse/SPARK-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648308#comment-14648308
 ] 

Apache Spark commented on SPARK-9458:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7803

 Avoid object allocation in prefix generation
 

 Key: SPARK-9458
 URL: https://issues.apache.org/jira/browse/SPARK-9458
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0


 In our existing sort prefix generation code, we use expression's eval method 
 to generate the prefix, which results in object allocation for every prefix.
 We can use the specialized getters available on InternalRow directly to avoid 
 the object allocation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9454) LDASuite should use vector comparisons


 [ 
https://issues.apache.org/jira/browse/SPARK-9454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9454.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7775
[https://github.com/apache/spark/pull/7775]

 LDASuite should use vector comparisons
 --

 Key: SPARK-9454
 URL: https://issues.apache.org/jira/browse/SPARK-9454
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Feynman Liang
Assignee: Feynman Liang
Priority: Minor
 Fix For: 1.5.0


 {{LDASuite}}'s OnlineLDAOptimizer one iteration currently compares 
 correctness using hacky string comparisons. We should compare the vectors 
 instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9486) Add aliasing to data sources to allow external packages to register themselves with Spark

2015-07-30 Thread Joseph Batchik (JIRA)

Joseph Batchik created SPARK-9486:
-

 Summary: Add aliasing to data sources to allow external packages 
to register themselves with Spark
 Key: SPARK-9486
 URL: https://issues.apache.org/jira/browse/SPARK-9486
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Joseph Batchik
Priority: Minor


Currently Spark allows users to use external data sources like spark-avro, 
spark-csv, etc by having them specifying their full class name:

{code:java}
sqlContext.read.format(com.databricks.spark.avro).load(path)
{code}

Typing in a full class is not the best idea so it would be nice to allow the 
external packages to be able to register themselves with Spark to allow users 
to do something like:

{code:java}
sqlContext.read.format(avro).load(path)
{code}

This would make it so that the external data source packages follow the same 
convention as the built in data sources do, parquet, json, jdbc, etc.

This could be accomplished by using a ServiceLoader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

Xiangrui Meng created SPARK-9487:


 Summary: Use the same num. worker threads in Scala/Python unit 
tests
 Key: SPARK-9487
 URL: https://issues.apache.org/jira/browse/SPARK-9487
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core, SQL, Tests
Affects Versions: 1.5.0
Reporter: Xiangrui Meng


In Python we use `local[4]` for unit tests, while in Scala/Java we use 
`local[2]` and `local` for some unit tests in SQL, MLLib, and other components. 
If the operation depends on partition IDs, e.g., random number generator, this 
will lead to different result in Python and Scala/Java. It would be nice to use 
the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4823) rowSimilarities

2015-07-30 Thread Debasish Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-4823:

Attachment: SparkMeetup2015-Experiments2.pdf
SparkMeetup2015-Experiments1.pdf

 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh
 Attachments: MovieLensSimilarity Comparisons.pdf, 
 SparkMeetup2015-Experiments1.pdf, SparkMeetup2015-Experiments2.pdf


 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4823) rowSimilarities

2015-07-30 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648340#comment-14648340
 ] 

Debasish Das commented on SPARK-4823:
-

We did more detailed experiment for July 2015 Spark Meetup to understand the 
shuffle effects on runtime. I attached the data for experiments in the JIRA. I 
will update the PR as discussed with Reza. I am targeting 1 PR for Spark 1.5.


 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh
 Attachments: MovieLensSimilarity Comparisons.pdf


 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9320) Add `summary` as a synonym for `describe`


 [ 
https://issues.apache.org/jira/browse/SPARK-9320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9320:
---

Assignee: (was: Apache Spark)

 Add `summary` as a synonym for `describe`
 -

 Key: SPARK-9320
 URL: https://issues.apache.org/jira/browse/SPARK-9320
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman

 `summary` is used to provide similar functionality in R data frames.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9318) Add `merge` as synonym for join


 [ 
https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9318:
---

Assignee: Apache Spark

 Add `merge` as synonym for join
 ---

 Key: SPARK-9318
 URL: https://issues.apache.org/jira/browse/SPARK-9318
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9320) Add `summary` as a synonym for `describe`


[ 
https://issues.apache.org/jira/browse/SPARK-9320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648388#comment-14648388
 ] 

Apache Spark commented on SPARK-9320:
-

User 'falaki' has created a pull request for this issue:
https://github.com/apache/spark/pull/7806

 Add `summary` as a synonym for `describe`
 -

 Key: SPARK-9320
 URL: https://issues.apache.org/jira/browse/SPARK-9320
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman

 `summary` is used to provide similar functionality in R data frames.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9318) Add `merge` as synonym for join


[ 
https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648387#comment-14648387
 ] 

Apache Spark commented on SPARK-9318:
-

User 'falaki' has created a pull request for this issue:
https://github.com/apache/spark/pull/7806

 Add `merge` as synonym for join
 ---

 Key: SPARK-9318
 URL: https://issues.apache.org/jira/browse/SPARK-9318
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9320) Add `summary` as a synonym for `describe`


 [ 
https://issues.apache.org/jira/browse/SPARK-9320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9320:
---

Assignee: Apache Spark

 Add `summary` as a synonym for `describe`
 -

 Key: SPARK-9320
 URL: https://issues.apache.org/jira/browse/SPARK-9320
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman
Assignee: Apache Spark

 `summary` is used to provide similar functionality in R data frames.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9318) Add `merge` as synonym for join


 [ 
https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9318:
---

Assignee: (was: Apache Spark)

 Add `merge` as synonym for join
 ---

 Key: SPARK-9318
 URL: https://issues.apache.org/jira/browse/SPARK-9318
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9489) Remove compatibleWith, meetsRequirements, and needsAnySort checks from Exchange


 [ 
https://issues.apache.org/jira/browse/SPARK-9489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9489:
---

Assignee: Apache Spark  (was: Josh Rosen)

 Remove compatibleWith, meetsRequirements, and needsAnySort checks from 
 Exchange
 ---

 Key: SPARK-9489
 URL: https://issues.apache.org/jira/browse/SPARK-9489
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Josh Rosen
Assignee: Apache Spark

 While reviewing [~yhuai]'s patch for SPARK-2205, I noticed that Exchange's 
 {{compatible}} check may be incorrectly returning {{false}} in many cases.  
 As far as I know, this is not actually a problem because the {{compatible}}, 
 {{meetsRequirements}}, and {{needsAnySort}} checks are serving only as 
 short-circuit performance optimizations that are not necessary for 
 correctness.
 In order to reduce code complexity, I think that we should remove these 
 checks and unconditionally rewrite the operator's children.  This should be 
 safe because we rewrite the tree in a single bottom-up pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9489) Remove compatibleWith, meetsRequirements, and needsAnySort checks from Exchange


[ 
https://issues.apache.org/jira/browse/SPARK-9489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648405#comment-14648405
 ] 

Apache Spark commented on SPARK-9489:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7807

 Remove compatibleWith, meetsRequirements, and needsAnySort checks from 
 Exchange
 ---

 Key: SPARK-9489
 URL: https://issues.apache.org/jira/browse/SPARK-9489
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen

 While reviewing [~yhuai]'s patch for SPARK-2205, I noticed that Exchange's 
 {{compatible}} check may be incorrectly returning {{false}} in many cases.  
 As far as I know, this is not actually a problem because the {{compatible}}, 
 {{meetsRequirements}}, and {{needsAnySort}} checks are serving only as 
 short-circuit performance optimizations that are not necessary for 
 correctness.
 In order to reduce code complexity, I think that we should remove these 
 checks and unconditionally rewrite the operator's children.  This should be 
 safe because we rewrite the tree in a single bottom-up pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9489) Remove compatibleWith, meetsRequirements, and needsAnySort checks from Exchange