[jira] [Commented] (SPARK-9489) Remove compatibleWith, meetsRequirements, and needsAnySort checks from Exchange

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648405#comment-14648405
 ] 

Apache Spark commented on SPARK-9489:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7807

> Remove compatibleWith, meetsRequirements, and needsAnySort checks from 
> Exchange
> ---
>
> Key: SPARK-9489
> URL: https://issues.apache.org/jira/browse/SPARK-9489
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> While reviewing [~yhuai]'s patch for SPARK-2205, I noticed that Exchange's 
> {{compatible}} check may be incorrectly returning {{false}} in many cases.  
> As far as I know, this is not actually a problem because the {{compatible}}, 
> {{meetsRequirements}}, and {{needsAnySort}} checks are serving only as 
> short-circuit performance optimizations that are not necessary for 
> correctness.
> In order to reduce code complexity, I think that we should remove these 
> checks and unconditionally rewrite the operator's children.  This should be 
> safe because we rewrite the tree in a single bottom-up pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9489) Remove compatibleWith, meetsRequirements, and needsAnySort checks from Exchange

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9489:
---

Assignee: Apache Spark  (was: Josh Rosen)

> Remove compatibleWith, meetsRequirements, and needsAnySort checks from 
> Exchange
> ---
>
> Key: SPARK-9489
> URL: https://issues.apache.org/jira/browse/SPARK-9489
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> While reviewing [~yhuai]'s patch for SPARK-2205, I noticed that Exchange's 
> {{compatible}} check may be incorrectly returning {{false}} in many cases.  
> As far as I know, this is not actually a problem because the {{compatible}}, 
> {{meetsRequirements}}, and {{needsAnySort}} checks are serving only as 
> short-circuit performance optimizations that are not necessary for 
> correctness.
> In order to reduce code complexity, I think that we should remove these 
> checks and unconditionally rewrite the operator's children.  This should be 
> safe because we rewrite the tree in a single bottom-up pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9489) Remove compatibleWith, meetsRequirements, and needsAnySort checks from Exchange

2015-07-30 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-9489:
-

 Summary: Remove compatibleWith, meetsRequirements, and 
needsAnySort checks from Exchange
 Key: SPARK-9489
 URL: https://issues.apache.org/jira/browse/SPARK-9489
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen


While reviewing [~yhuai]'s patch for SPARK-2205, I noticed that Exchange's 
{{compatible}} check may be incorrectly returning {{false}} in many cases.  As 
far as I know, this is not actually a problem because the {{compatible}}, 
{{meetsRequirements}}, and {{needsAnySort}} checks are serving only as 
short-circuit performance optimizations that are not necessary for correctness.

In order to reduce code complexity, I think that we should remove these checks 
and unconditionally rewrite the operator's children.  This should be safe 
because we rewrite the tree in a single bottom-up pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9320) Add `summary` as a synonym for `describe`

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648388#comment-14648388
 ] 

Apache Spark commented on SPARK-9320:
-

User 'falaki' has created a pull request for this issue:
https://github.com/apache/spark/pull/7806

> Add `summary` as a synonym for `describe`
> -
>
> Key: SPARK-9320
> URL: https://issues.apache.org/jira/browse/SPARK-9320
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> `summary` is used to provide similar functionality in R data frames.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9320) Add `summary` as a synonym for `describe`

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9320:
---

Assignee: (was: Apache Spark)

> Add `summary` as a synonym for `describe`
> -
>
> Key: SPARK-9320
> URL: https://issues.apache.org/jira/browse/SPARK-9320
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> `summary` is used to provide similar functionality in R data frames.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9318) Add `merge` as synonym for join

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9318:
---

Assignee: Apache Spark

> Add `merge` as synonym for join
> ---
>
> Key: SPARK-9318
> URL: https://issues.apache.org/jira/browse/SPARK-9318
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9318) Add `merge` as synonym for join

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648387#comment-14648387
 ] 

Apache Spark commented on SPARK-9318:
-

User 'falaki' has created a pull request for this issue:
https://github.com/apache/spark/pull/7806

> Add `merge` as synonym for join
> ---
>
> Key: SPARK-9318
> URL: https://issues.apache.org/jira/browse/SPARK-9318
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9320) Add `summary` as a synonym for `describe`

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9320:
---

Assignee: Apache Spark

> Add `summary` as a synonym for `describe`
> -
>
> Key: SPARK-9320
> URL: https://issues.apache.org/jira/browse/SPARK-9320
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Apache Spark
>
> `summary` is used to provide similar functionality in R data frames.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9318) Add `merge` as synonym for join

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9318:
---

Assignee: (was: Apache Spark)

> Add `merge` as synonym for join
> ---
>
> Key: SPARK-9318
> URL: https://issues.apache.org/jira/browse/SPARK-9318
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6486) Add BlockMatrix in PySpark

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6486:
-
Shepherd: Xiangrui Meng

> Add BlockMatrix in PySpark
> --
>
> Key: SPARK-6486
> URL: https://issues.apache.org/jira/browse/SPARK-6486
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Mike Dusenberry
>
> We should add BlockMatrix to PySpark. Internally, we can use DataFrames and 
> MatrixUDT for serialization. This JIRA should contain conversions between 
> IndexedRowMatrix/CoordinateMatrix to block matrices. But this does NOT cover 
> linear algebra operations of block matrices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6486) Add BlockMatrix in PySpark

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6486:
-
Assignee: Mike Dusenberry

> Add BlockMatrix in PySpark
> --
>
> Key: SPARK-6486
> URL: https://issues.apache.org/jira/browse/SPARK-6486
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Mike Dusenberry
>
> We should add BlockMatrix to PySpark. Internally, we can use DataFrames and 
> MatrixUDT for serialization. This JIRA should contain conversions between 
> IndexedRowMatrix/CoordinateMatrix to block matrices. But this does NOT cover 
> linear algebra operations of block matrices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9488) pyspark.sql.types.Row very slow when used named arguments

2015-07-30 Thread Alexis Benoist (JIRA)
Alexis Benoist created SPARK-9488:
-

 Summary: pyspark.sql.types.Row very slow when used named arguments
 Key: SPARK-9488
 URL: https://issues.apache.org/jira/browse/SPARK-9488
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
 Environment: 

Reporter: Alexis Benoist


We can see that the implementation of the Row is accessing items in O(n).
https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1217
We could use an OrderedDict instead of a tuple to make the access time in O(1). 
Can the keys be of an unhashable type?

I'm ok to do the edit.

Cheers,
Alexis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9488) pyspark.sql.types.Row very slow when using named arguments

2015-07-30 Thread Alexis Benoist (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis Benoist updated SPARK-9488:
--
Summary: pyspark.sql.types.Row very slow when using named arguments  (was: 
pyspark.sql.types.Row very slow when used named arguments)

> pyspark.sql.types.Row very slow when using named arguments
> --
>
> Key: SPARK-9488
> URL: https://issues.apache.org/jira/browse/SPARK-9488
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0
> Environment: 
>Reporter: Alexis Benoist
>  Labels: performance
>
> We can see that the implementation of the Row is accessing items in O(n).
> https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1217
> We could use an OrderedDict instead of a tuple to make the access time in 
> O(1). Can the keys be of an unhashable type?
> I'm ok to do the edit.
> Cheers,
> Alexis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8197) date/time function: trunc

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648351#comment-14648351
 ] 

Apache Spark commented on SPARK-8197:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/7805

> date/time function: trunc
> -
>
> Key: SPARK-8197
> URL: https://issues.apache.org/jira/browse/SPARK-8197
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> trunc(string date[, string format]): string
> trunc(date date[, string format]): date
> Returns date truncated to the unit specified by the format (as of Hive 
> 1.2.0). Supported formats: MONTH/MON/MM, YEAR//YY. If format is omitted 
> the date will be truncated to the nearest day. Example: trunc('2015-03-17', 
> 'MM') = 2015-03-01.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8176) date/time function: to_date

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648350#comment-14648350
 ] 

Apache Spark commented on SPARK-8176:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/7805

> date/time function: to_date
> ---
>
> Key: SPARK-8176
> URL: https://issues.apache.org/jira/browse/SPARK-8176
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Adrian Wang
>
> parse a timestamp string and return the date portion
> {code}
> to_date(string timestamp): date
> {code}
> Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = 
> "1970-01-01" (in some date format)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6684) Add checkpointing to GradientBoostedTrees

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6684:
-
Shepherd: Xiangrui Meng

> Add checkpointing to GradientBoostedTrees
> -
>
> Key: SPARK-6684
> URL: https://issues.apache.org/jira/browse/SPARK-6684
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> We should add checkpointing to GradientBoostedTrees since it maintains RDDs 
> with long lineages.
> keywords: gradient boosting, gbt, gradient boosted trees



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9408) Refactor mllib/linalg.py to mllib/linalg

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9408:
-
Shepherd: Davies Liu  (was: Xiangrui Meng)

> Refactor mllib/linalg.py to mllib/linalg
> 
>
> Key: SPARK-9408
> URL: https://issues.apache.org/jira/browse/SPARK-9408
> Project: Spark
>  Issue Type: Task
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>Assignee: Manoj Kumar
>
> We need to refactor mllib/linalg.py to mllib/linalg so that the project 
> structure is similar to that of Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4823) rowSimilarities

2015-07-30 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-4823:

Attachment: SparkMeetup2015-Experiments2.pdf
SparkMeetup2015-Experiments1.pdf

> rowSimilarities
> ---
>
> Key: SPARK-4823
> URL: https://issues.apache.org/jira/browse/SPARK-4823
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
> Attachments: MovieLensSimilarity Comparisons.pdf, 
> SparkMeetup2015-Experiments1.pdf, SparkMeetup2015-Experiments2.pdf
>
>
> RowMatrix has a columnSimilarities method to find cosine similarities between 
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a 
> method, better than brute-forcing it. Note that when there are many rows (> 
> 10^6), it is unlikely that brute-force will be feasible, since the output 
> will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2015-07-30 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648340#comment-14648340
 ] 

Debasish Das commented on SPARK-4823:
-

We did more detailed experiment for July 2015 Spark Meetup to understand the 
shuffle effects on runtime. I attached the data for experiments in the JIRA. I 
will update the PR as discussed with Reza. I am targeting 1 PR for Spark 1.5.


> rowSimilarities
> ---
>
> Key: SPARK-4823
> URL: https://issues.apache.org/jira/browse/SPARK-4823
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
> Attachments: MovieLensSimilarity Comparisons.pdf
>
>
> RowMatrix has a columnSimilarities method to find cosine similarities between 
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a 
> method, better than brute-forcing it. Note that when there are many rows (> 
> 10^6), it is unlikely that brute-force will be feasible, since the output 
> will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9487:
-
Target Version/s: 1.5.0

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2015-07-30 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-9487:


 Summary: Use the same num. worker threads in Scala/Python unit 
tests
 Key: SPARK-9487
 URL: https://issues.apache.org/jira/browse/SPARK-9487
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core, SQL, Tests
Affects Versions: 1.5.0
Reporter: Xiangrui Meng


In Python we use `local[4]` for unit tests, while in Scala/Java we use 
`local[2]` and `local` for some unit tests in SQL, MLLib, and other components. 
If the operation depends on partition IDs, e.g., random number generator, this 
will lead to different result in Python and Scala/Java. It would be nice to use 
the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6684) Add checkpointing to GradientBoostedTrees

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6684:
---

Assignee: Joseph K. Bradley  (was: Apache Spark)

> Add checkpointing to GradientBoostedTrees
> -
>
> Key: SPARK-6684
> URL: https://issues.apache.org/jira/browse/SPARK-6684
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> We should add checkpointing to GradientBoostedTrees since it maintains RDDs 
> with long lineages.
> keywords: gradient boosting, gbt, gradient boosted trees



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6684) Add checkpointing to GradientBoostedTrees

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648310#comment-14648310
 ] 

Apache Spark commented on SPARK-6684:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/7804

> Add checkpointing to GradientBoostedTrees
> -
>
> Key: SPARK-6684
> URL: https://issues.apache.org/jira/browse/SPARK-6684
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> We should add checkpointing to GradientBoostedTrees since it maintains RDDs 
> with long lineages.
> keywords: gradient boosting, gbt, gradient boosted trees



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6684) Add checkpointing to GradientBoostedTrees

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6684:
---

Assignee: Apache Spark  (was: Joseph K. Bradley)

> Add checkpointing to GradientBoostedTrees
> -
>
> Key: SPARK-6684
> URL: https://issues.apache.org/jira/browse/SPARK-6684
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> We should add checkpointing to GradientBoostedTrees since it maintains RDDs 
> with long lineages.
> keywords: gradient boosting, gbt, gradient boosted trees



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9469) TungstenSort should not do safe -> unsafe conversion itself

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648309#comment-14648309
 ] 

Apache Spark commented on SPARK-9469:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7803

> TungstenSort should not do safe -> unsafe conversion itself
> ---
>
> Key: SPARK-9469
> URL: https://issues.apache.org/jira/browse/SPARK-9469
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> TungstenSort itself assumes input rows are safe rows, and uses a projection 
> to turn the safe rows into UnsafeRows. We should take that part of the logic 
> out of TungstenSort, and let the planner take care of the conversion. In that 
> case, if the input is UnsafeRow already, no conversion is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9454) LDASuite should use vector comparisons

2015-07-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9454:
-
Assignee: Feynman Liang

> LDASuite should use vector comparisons
> --
>
> Key: SPARK-9454
> URL: https://issues.apache.org/jira/browse/SPARK-9454
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Feynman Liang
>Assignee: Feynman Liang
>Priority: Minor
> Fix For: 1.5.0
>
>
> {{LDASuite}}'s "OnlineLDAOptimizer one iteration" currently compares 
> correctness using hacky string comparisons. We should compare the vectors 
> instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9454) LDASuite should use vector comparisons

2015-07-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9454:
-
Shepherd: Joseph K. Bradley

> LDASuite should use vector comparisons
> --
>
> Key: SPARK-9454
> URL: https://issues.apache.org/jira/browse/SPARK-9454
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Feynman Liang
>Assignee: Feynman Liang
>Priority: Minor
> Fix For: 1.5.0
>
>
> {{LDASuite}}'s "OnlineLDAOptimizer one iteration" currently compares 
> correctness using hacky string comparisons. We should compare the vectors 
> instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9454) LDASuite should use vector comparisons

2015-07-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9454.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7775
[https://github.com/apache/spark/pull/7775]

> LDASuite should use vector comparisons
> --
>
> Key: SPARK-9454
> URL: https://issues.apache.org/jira/browse/SPARK-9454
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Feynman Liang
>Assignee: Feynman Liang
>Priority: Minor
> Fix For: 1.5.0
>
>
> {{LDASuite}}'s "OnlineLDAOptimizer one iteration" currently compares 
> correctness using hacky string comparisons. We should compare the vectors 
> instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9458) Avoid object allocation in prefix generation

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648308#comment-14648308
 ] 

Apache Spark commented on SPARK-9458:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7803

> Avoid object allocation in prefix generation
> 
>
> Key: SPARK-9458
> URL: https://issues.apache.org/jira/browse/SPARK-9458
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.5.0
>
>
> In our existing sort prefix generation code, we use expression's eval method 
> to generate the prefix, which results in object allocation for every prefix.
> We can use the specialized getters available on InternalRow directly to avoid 
> the object allocation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9486) Add aliasing to data sources to allow external packages to register themselves with Spark

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9486:
---

Assignee: Apache Spark

> Add aliasing to data sources to allow external packages to register 
> themselves with Spark
> -
>
> Key: SPARK-9486
> URL: https://issues.apache.org/jira/browse/SPARK-9486
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Joseph Batchik
>Assignee: Apache Spark
>Priority: Minor
>
> Currently Spark allows users to use external data sources like spark-avro, 
> spark-csv, etc by having them specifying their full class name:
> {code:java}
> sqlContext.read.format("com.databricks.spark.avro").load(path)
> {code}
> Typing in a full class is not the best idea so it would be nice to allow the 
> external packages to be able to register themselves with Spark to allow users 
> to do something like:
> {code:java}
> sqlContext.read.format("avro").load(path)
> {code}
> This would make it so that the external data source packages follow the same 
> convention as the built in data sources do, parquet, json, jdbc, etc.
> This could be accomplished by using a ServiceLoader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9486) Add aliasing to data sources to allow external packages to register themselves with Spark

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9486:
---

Assignee: (was: Apache Spark)

> Add aliasing to data sources to allow external packages to register 
> themselves with Spark
> -
>
> Key: SPARK-9486
> URL: https://issues.apache.org/jira/browse/SPARK-9486
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Joseph Batchik
>Priority: Minor
>
> Currently Spark allows users to use external data sources like spark-avro, 
> spark-csv, etc by having them specifying their full class name:
> {code:java}
> sqlContext.read.format("com.databricks.spark.avro").load(path)
> {code}
> Typing in a full class is not the best idea so it would be nice to allow the 
> external packages to be able to register themselves with Spark to allow users 
> to do something like:
> {code:java}
> sqlContext.read.format("avro").load(path)
> {code}
> This would make it so that the external data source packages follow the same 
> convention as the built in data sources do, parquet, json, jdbc, etc.
> This could be accomplished by using a ServiceLoader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9486) Add aliasing to data sources to allow external packages to register themselves with Spark

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648287#comment-14648287
 ] 

Apache Spark commented on SPARK-9486:
-

User 'JDrit' has created a pull request for this issue:
https://github.com/apache/spark/pull/7802

> Add aliasing to data sources to allow external packages to register 
> themselves with Spark
> -
>
> Key: SPARK-9486
> URL: https://issues.apache.org/jira/browse/SPARK-9486
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Joseph Batchik
>Priority: Minor
>
> Currently Spark allows users to use external data sources like spark-avro, 
> spark-csv, etc by having them specifying their full class name:
> {code:java}
> sqlContext.read.format("com.databricks.spark.avro").load(path)
> {code}
> Typing in a full class is not the best idea so it would be nice to allow the 
> external packages to be able to register themselves with Spark to allow users 
> to do something like:
> {code:java}
> sqlContext.read.format("avro").load(path)
> {code}
> This would make it so that the external data source packages follow the same 
> convention as the built in data sources do, parquet, json, jdbc, etc.
> This could be accomplished by using a ServiceLoader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-967) start-slaves.sh uses local path from master on remote slave nodes

2015-07-30 Thread David Chin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648234#comment-14648234
 ] 

David Chin edited comment on SPARK-967 at 7/30/15 8:34 PM:
---

I won't create a pull request unless asked to, but I have a solution for this. 
I am running Spark in standalone mode within a Univa Grid Engine cluster. As 
such, configs and logs, etc should be specific to each UGE job, identified by 
an integer job ID. 

Currently, any environment variables on the master are not passed along by the 
sbin/start-slaves.sh invocation of ssh. I put in a fix on my local version, 
which works.  However, this is still less than ideal in that UGE's job 
accounting cannot keep track of resource usage by jobs not under its process 
tree. Not sure, yet, what the correct solution is. I thought I saw a feature 
request to allow other remote shell programs besides ssh, but I can't find it 
now.

Please see my version of sbin/start-slaves.sh here, forked from current master: 
https://github.com/prehensilecode/spark/blob/master/sbin/start-slaves.sh


was (Author: prehensilecode):
I won't create a pull request unless asked to, but I have a solution for this. 
I am running Spark in standalone mode within a Univa Grid Engine cluster. As 
such, configs and logs, etc should be specific to each UGE job, identified by 
an integer job ID. 

Currently, any environment variables on the master are not passed along by the 
sbin/start-slaves.sh invocation of ssh. I put in a fix on my local version, 
which works.  However, this is still less than ideal in that UGE's job 
accounting cannot keep track of resource usage by jobs not under its process 
tree. Not sure, yet, what the correct solution is. I thought I saw a feature 
request to allow other remote shell programs besides ssh, but I can't find it 
now.

Please see my version of sbin/start-slaves.sh here: 
https://github.com/prehensilecode/spark/blob/master/sbin/start-slaves.sh

> start-slaves.sh uses local path from master on remote slave nodes
> -
>
> Key: SPARK-967
> URL: https://issues.apache.org/jira/browse/SPARK-967
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 0.8.0, 0.8.1, 0.9.0
>Reporter: Evgeniy Tsvigun
>Priority: Trivial
>  Labels: script, starter
>
> If a slave node has home path other than master, start-slave.sh fails to 
> start a worker instance, for other nodes behaves as expected, in my case: 
> $ ./bin/start-slaves.sh 
> node05.dev.vega.ru: bash: line 0: cd: /usr/home/etsvigun/spark/bin/..: No 
> such file or directory
> node04.dev.vega.ru: org.apache.spark.deploy.worker.Worker running as 
> process 4796. Stop it first.
> node03.dev.vega.ru: org.apache.spark.deploy.worker.Worker running as 
> process 61348. Stop it first.
> I don't mention /usr/home anywhere, the only environment variable I set is 
> $SPARK_HOME, relative to $HOME on every node, which makes me think some 
> script takes `pwd` on master and tries to use it on slaves. 
> Spark version: fb6875dd5c9334802580155464cef9ac4d4cc1f0
> OS:  FreeBSD 8.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9481) LocalLDAModel logLikelihood

2015-07-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9481:
-
Shepherd: Joseph K. Bradley
Assignee: Feynman Liang

> LocalLDAModel logLikelihood
> ---
>
> Key: SPARK-9481
> URL: https://issues.apache.org/jira/browse/SPARK-9481
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Feynman Liang
>Assignee: Feynman Liang
>Priority: Trivial
>
> We already have a variational {{bound}} method so we should provide a public 
> {{logLikelihood}} that uses the model's parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client

2015-07-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9485:
-
Shepherd:   (was: MEN CHAMROEUN)
Target Version/s:   (was: 1.4.1)
 Environment: (was: DEV)

Please review 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark  -- 
this JIRA had some fields set that should not be.

I don't think that helps since it's just a list of your local configs, specific 
to your environment. Obivously, in general yarn-client mode does not yield a 
failure on startup so this isn't quite helpful in understanding the failure. It 
seems specific to your env.

> Failed to connect to yarn / spark-submit --master yarn-client
> -
>
> Key: SPARK-9485
> URL: https://issues.apache.org/jira/browse/SPARK-9485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, YARN
>Affects Versions: 1.4.1
>Reporter: Philip Adetiloye
>Priority: Minor
>
> Spark-submit throws an exception when connecting to yarn but it works when  
> used in standalone mode.
> I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
> got the same exception below.
> spark-submit --master yarn-client
> Here is a stack trace of the exception:
> 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
> 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
> executors
> Exception in thread "Yarn application state monitor" 
> org.apache.spark.SparkException: Error asking standalone schedule
> r to shut down executors
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
> d.scala:261)
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
> 66)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
> at 
> org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
> 139)
> Caused by: java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java
> :1326)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
> a:945)
> at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> java.lang.NullPointerException
> at org.apache.spark.sql.SQLContext.(SQLContext.scala:193)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033)
> at $iwC$$iwC.(:9)
> at $iwC.(:

[jira] [Resolved] (SPARK-8186) date/time function: date_add

2015-07-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8186.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7754
[https://github.com/apache/spark/pull/7754]

> date/time function: date_add
> 
>
> Key: SPARK-8186
> URL: https://issues.apache.org/jira/browse/SPARK-8186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Adrian Wang
> Fix For: 1.5.0
>
>
> date_add(timestamp startdate, int days): timestamp
> date_add(timestamp startdate, interval i): timestamp
> date_add(date date, int days): date
> date_add(date date, interval i): date



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9133) Add and Subtract should support date/timestamp and interval type

2015-07-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9133.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7754
[https://github.com/apache/spark/pull/7754]

> Add and Subtract should support date/timestamp and interval type
> 
>
> Key: SPARK-9133
> URL: https://issues.apache.org/jira/browse/SPARK-9133
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
> Fix For: 1.5.0
>
>
> Should support
> date + interval
> interval + date
> timestamp + interval
> interval + timestamp
> The best way to support this is probably to resolve this to a date 
> add/substract expression, rather than making add/subtract support these types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8194) date/time function: add_months

2015-07-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8194.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7754
[https://github.com/apache/spark/pull/7754]

> date/time function: add_months
> --
>
> Key: SPARK-8194
> URL: https://issues.apache.org/jira/browse/SPARK-8194
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
> Fix For: 1.5.0
>
>
> add_months(string start_date, int num_months): string
> add_months(date start_date, int num_months): date
> Returns the date that is num_months after start_date. The time part of 
> start_date is ignored. If start_date is the last day of the month or if the 
> resulting month has fewer days than the day component of start_date, then the 
> result is the last day of the resulting month. Otherwise, the result has the 
> same day component as start_date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9290) DateExpressionsSuite is slow to run

2015-07-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9290.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7754
[https://github.com/apache/spark/pull/7754]

> DateExpressionsSuite is slow to run
> ---
>
> Key: SPARK-9290
> URL: https://issues.apache.org/jira/browse/SPARK-9290
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
> Fix For: 1.5.0
>
>
> We are running way too many test cases in here.
> {code}
> [info] - DayOfYear (16 seconds, 998 milliseconds)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8187) date/time function: date_sub

2015-07-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8187.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7754
[https://github.com/apache/spark/pull/7754]

> date/time function: date_sub
> 
>
> Key: SPARK-8187
> URL: https://issues.apache.org/jira/browse/SPARK-8187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Adrian Wang
> Fix For: 1.5.0
>
>
> date_sub(timestamp startdate, int days): timestamp
> date_sub(timestamp startdate, interval i): timestamp
> date_sub(date date, int days): date
> date_sub(date date, interval i): date



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8198) date/time function: months_between

2015-07-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8198.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7754
[https://github.com/apache/spark/pull/7754]

> date/time function: months_between
> --
>
> Key: SPARK-8198
> URL: https://issues.apache.org/jira/browse/SPARK-8198
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
> Fix For: 1.5.0
>
>
> months_between(date1, date2): double
> Returns number of months between dates date1 and date2 (as of Hive 1.2.0). If 
> date1 is later than date2, then the result is positive. If date1 is earlier 
> than date2, then the result is negative. If date1 and date2 are either the 
> same days of the month or both last days of months, then the result is always 
> an integer. Otherwise the UDF calculates the fractional portion of the result 
> based on a 31-day month and considers the difference in time components date1 
> and date2. date1 and date2 type can be date, timestamp or string in the 
> format '-MM-dd' or '-MM-dd HH:mm:ss'. The result is rounded to 8 
> decimal places. Example: months_between('1997-02-28 10:30:00', '1996-10-30') 
> = 3.94959677



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5567) Add prediction methods to LDA

2015-07-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5567:
-
Assignee: Feynman Liang

> Add prediction methods to LDA
> -
>
> Key: SPARK-5567
> URL: https://issues.apache.org/jira/browse/SPARK-5567
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Feynman Liang
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> LDA currently supports prediction on the training set.  E.g., you can call 
> logLikelihood and topicDistributions to get that info for the training data.  
> However, it should support the same functionality for new (test) documents.
> This will require inference but should be able to use the same code, with a 
> few modification to keep the inferred topics fixed.
> Note: The API for these methods is already in the code but is commented out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client

2015-07-30 Thread Philip Adetiloye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648233#comment-14648233
 ] 

Philip Adetiloye edited comment on SPARK-9485 at 7/30/15 8:17 PM:
--

[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
SPARK_YARN_QUEUE="dev"

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:


export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
"s:bin/java::")
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_OPTS="$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native"

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH


Hope this helps.

Thanks,
 Phil


was (Author: pkadetiloye):
[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
SPARK_YARN_QUEUE="dev"

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:


export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
"s:bin/java::")
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_OPTS="$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native"

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH


Hope this helps.

Thanks,
- Phil

> Failed to connect to yarn / spark-submit --master yarn-client
> -
>
> Key: SPARK-9485
> URL: https://issues.apache.org/jira/browse/SPARK-9485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, YARN
>Affects Versions: 1.4.1
> Environment: DEV
>Reporter: Philip Adetiloye
>Priority: Minor
>
> Spark-submit throws an exception when connecting to yarn but it works when  
> used in standalone mode.
> I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
> got the same exception below.
> spark-submit --master yarn-client
> Here is a stack trace of the exception:
> 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
> 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
> executors
> Exception in thread "Yarn application state monitor" 
> org.apache.spark.SparkException: Error asking standalone schedule
> r to shut down executors
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
> d.scala:261)
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
> 66)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
> at 
> org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
> 139)
> Caused by: java.lang

[jira] [Resolved] (SPARK-5567) Add prediction methods to LDA

2015-07-30 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-5567.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7760
[https://github.com/apache/spark/pull/7760]

> Add prediction methods to LDA
> -
>
> Key: SPARK-5567
> URL: https://issues.apache.org/jira/browse/SPARK-5567
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Feynman Liang
> Fix For: 1.5.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> LDA currently supports prediction on the training set.  E.g., you can call 
> logLikelihood and topicDistributions to get that info for the training data.  
> However, it should support the same functionality for new (test) documents.
> This will require inference but should be able to use the same code, with a 
> few modification to keep the inferred topics fixed.
> Note: The API for these methods is already in the code but is commented out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client

2015-07-30 Thread Philip Adetiloye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648233#comment-14648233
 ] 

Philip Adetiloye edited comment on SPARK-9485 at 7/30/15 8:16 PM:
--

[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
SPARK_YARN_QUEUE="dev"

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:


export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
"s:bin/java::")
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_OPTS="$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native"

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH


Hope this helps.

Thanks,
- Phil


was (Author: pkadetiloye):
[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
SPARK_YARN_QUEUE="dev"

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:

`
export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
"s:bin/java::")
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_OPTS="$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native"

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH

`
Hope this helps.

Thanks,
- Phil

> Failed to connect to yarn / spark-submit --master yarn-client
> -
>
> Key: SPARK-9485
> URL: https://issues.apache.org/jira/browse/SPARK-9485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, YARN
>Affects Versions: 1.4.1
> Environment: DEV
>Reporter: Philip Adetiloye
>Priority: Minor
>
> Spark-submit throws an exception when connecting to yarn but it works when  
> used in standalone mode.
> I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
> got the same exception below.
> spark-submit --master yarn-client
> Here is a stack trace of the exception:
> 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
> 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
> executors
> Exception in thread "Yarn application state monitor" 
> org.apache.spark.SparkException: Error asking standalone schedule
> r to shut down executors
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
> d.scala:261)
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
> 66)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
> at 
> org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
> 139)
> Caused by: java.lang.Inte

[jira] [Comment Edited] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client

2015-07-30 Thread Philip Adetiloye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648233#comment-14648233
 ] 

Philip Adetiloye edited comment on SPARK-9485 at 7/30/15 8:16 PM:
--

[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
SPARK_YARN_QUEUE="dev"

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:

`
export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
"s:bin/java::")
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_OPTS="$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native"

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH

`
Hope this helps.

Thanks,
- Phil


was (Author: pkadetiloye):
[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
SPARK_YARN_QUEUE="dev"

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:

export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
"s:bin/java::")
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_OPTS="$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native"

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH


Hope this helps.

Thanks,
- Phil

> Failed to connect to yarn / spark-submit --master yarn-client
> -
>
> Key: SPARK-9485
> URL: https://issues.apache.org/jira/browse/SPARK-9485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, YARN
>Affects Versions: 1.4.1
> Environment: DEV
>Reporter: Philip Adetiloye
>Priority: Minor
>
> Spark-submit throws an exception when connecting to yarn but it works when  
> used in standalone mode.
> I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
> got the same exception below.
> spark-submit --master yarn-client
> Here is a stack trace of the exception:
> 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
> 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
> executors
> Exception in thread "Yarn application state monitor" 
> org.apache.spark.SparkException: Error asking standalone schedule
> r to shut down executors
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
> d.scala:261)
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
> 66)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
> at 
> org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
> 139)
> Caused by: java.lang.Inter

[jira] [Commented] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client

2015-07-30 Thread Philip Adetiloye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648233#comment-14648233
 ] 

Philip Adetiloye commented on SPARK-9485:
-

[~srowen] Thanks for the quick reply. It actually consistent (everytime) and 
here is the details of my configuration.

conf/spark-env.sh basically has this settings:

#!/usr/bin/env bash
HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
SPARK_YARN_QUEUE="dev"

and my conf/slaves
10.0.0.204
10.0.0.205

~/.profile contains my settings here:

export JAVA_HOME=$(readlink -f  /usr/share/jdk1.8.0_45/bin/java | sed 
"s:bin/java::")
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_CONF_DIR=${HADOOP_HOME}"/etc/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_CONF_DIR=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_OPTS="$HADOOP_OPTS 
-Djava.library.path=/usr/local/hadoop/lib/native"

export PATH=$PATH:/usr/local/spark/sbin
export PATH=$PATH:/usr/local/spark/bin
export 
LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:/usr/local/hadoop/lib/native/

export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH


Hope this helps.

Thanks,
- Phil

> Failed to connect to yarn / spark-submit --master yarn-client
> -
>
> Key: SPARK-9485
> URL: https://issues.apache.org/jira/browse/SPARK-9485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, YARN
>Affects Versions: 1.4.1
> Environment: DEV
>Reporter: Philip Adetiloye
>Priority: Minor
>
> Spark-submit throws an exception when connecting to yarn but it works when  
> used in standalone mode.
> I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
> got the same exception below.
> spark-submit --master yarn-client
> Here is a stack trace of the exception:
> 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
> 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
> executors
> Exception in thread "Yarn application state monitor" 
> org.apache.spark.SparkException: Error asking standalone schedule
> r to shut down executors
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
> d.scala:261)
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
> 66)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
> at 
> org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
> 139)
> Caused by: java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java
> :1326)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
> a:945)
> at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke

[jira] [Commented] (SPARK-967) start-slaves.sh uses local path from master on remote slave nodes

2015-07-30 Thread David Chin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648234#comment-14648234
 ] 

David Chin commented on SPARK-967:
--

I won't create a pull request unless asked to, but I have a solution for this. 
I am running Spark in standalone mode within a Univa Grid Engine cluster. As 
such, configs and logs, etc should be specific to each UGE job, identified by 
an integer job ID. 

Currently, any environment variables on the master are not passed along by the 
sbin/start-slaves.sh invocation of ssh. I put in a fix on my local version, 
which works.  However, this is still less than ideal in that UGE's job 
accounting cannot keep track of resource usage by jobs not under its process 
tree. Not sure, yet, what the correct solution is. I thought I saw a feature 
request to allow other remote shell programs besides ssh, but I can't find it 
now.

Please see my version of sbin/start-slaves.sh here: 
https://github.com/prehensilecode/spark/blob/master/sbin/start-slaves.sh

> start-slaves.sh uses local path from master on remote slave nodes
> -
>
> Key: SPARK-967
> URL: https://issues.apache.org/jira/browse/SPARK-967
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 0.8.0, 0.8.1, 0.9.0
>Reporter: Evgeniy Tsvigun
>Priority: Trivial
>  Labels: script, starter
>
> If a slave node has home path other than master, start-slave.sh fails to 
> start a worker instance, for other nodes behaves as expected, in my case: 
> $ ./bin/start-slaves.sh 
> node05.dev.vega.ru: bash: line 0: cd: /usr/home/etsvigun/spark/bin/..: No 
> such file or directory
> node04.dev.vega.ru: org.apache.spark.deploy.worker.Worker running as 
> process 4796. Stop it first.
> node03.dev.vega.ru: org.apache.spark.deploy.worker.Worker running as 
> process 61348. Stop it first.
> I don't mention /usr/home anywhere, the only environment variable I set is 
> $SPARK_HOME, relative to $HOME on every node, which makes me think some 
> script takes `pwd` on master and tries to use it on slaves. 
> Spark version: fb6875dd5c9334802580155464cef9ac4d4cc1f0
> OS:  FreeBSD 8.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9486) Add aliasing to data sources to allow external packages to register themselves with Spark

2015-07-30 Thread Joseph Batchik (JIRA)
Joseph Batchik created SPARK-9486:
-

 Summary: Add aliasing to data sources to allow external packages 
to register themselves with Spark
 Key: SPARK-9486
 URL: https://issues.apache.org/jira/browse/SPARK-9486
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Joseph Batchik
Priority: Minor


Currently Spark allows users to use external data sources like spark-avro, 
spark-csv, etc by having them specifying their full class name:

{code:java}
sqlContext.read.format("com.databricks.spark.avro").load(path)
{code}

Typing in a full class is not the best idea so it would be nice to allow the 
external packages to be able to register themselves with Spark to allow users 
to do something like:

{code:java}
sqlContext.read.format("avro").load(path)
{code}

This would make it so that the external data source packages follow the same 
convention as the built in data sources do, parquet, json, jdbc, etc.

This could be accomplished by using a ServiceLoader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client

2015-07-30 Thread Philip Adetiloye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-9485:

Shepherd: MEN CHAMROEUN

> Failed to connect to yarn / spark-submit --master yarn-client
> -
>
> Key: SPARK-9485
> URL: https://issues.apache.org/jira/browse/SPARK-9485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, YARN
>Affects Versions: 1.4.1
> Environment: DEV
>Reporter: Philip Adetiloye
>Priority: Minor
>
> Spark-submit throws an exception when connecting to yarn but it works when  
> used in standalone mode.
> I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
> got the same exception below.
> spark-submit --master yarn-client
> Here is a stack trace of the exception:
> 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
> 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
> executors
> Exception in thread "Yarn application state monitor" 
> org.apache.spark.SparkException: Error asking standalone schedule
> r to shut down executors
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
> d.scala:261)
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
> 66)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
> at 
> org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
> 139)
> Caused by: java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java
> :1326)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
> a:945)
> at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> java.lang.NullPointerException
> at org.apache.spark.sql.SQLContext.(SQLContext.scala:193)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033)
> at $iwC$$iwC.(:9)
> at $iwC.(:18)
> at (:20)
> at .(:24)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 

[jira] [Commented] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client

2015-07-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648200#comment-14648200
 ] 

Sean Owen commented on SPARK-9485:
--

I don't think this is sufficient to be a JIRA bug report; there's no detail for 
reproducing it. It also just appears to be some kind of (other) error at 
startup causing initialization to fail. Can you start on user@ please? and if 
there isn't guidance there, provide a consistent reproduction?

> Failed to connect to yarn / spark-submit --master yarn-client
> -
>
> Key: SPARK-9485
> URL: https://issues.apache.org/jira/browse/SPARK-9485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, YARN
>Affects Versions: 1.4.1
> Environment: DEV
>Reporter: Philip Adetiloye
>Priority: Minor
>
> Spark-submit throws an exception when connecting to yarn but it works when  
> used in standalone mode.
> I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
> got the same exception below.
> spark-submit --master yarn-client
> Here is a stack trace of the exception:
> 15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
> 15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
> executors
> Exception in thread "Yarn application state monitor" 
> org.apache.spark.SparkException: Error asking standalone schedule
> r to shut down executors
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
> d.scala:261)
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
> 66)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
> at 
> org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
> 139)
> Caused by: java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java
> :1326)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
> a:945)
> at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> java.lang.NullPointerException
> at org.apache.spark.sql.SQLContext.(SQLContext.scala:193)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033)
> at $iwC$$iwC.(:9)
> at $iwC.(:18)
> at (:20)
> at .(:24)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.Native

[jira] [Updated] (SPARK-9485) Failed to connect to yarn / spark-submit --master yarn-client

2015-07-30 Thread Philip Adetiloye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Adetiloye updated SPARK-9485:

Description: 
Spark-submit throws an exception when connecting to yarn but it works when  
used in standalone mode.

I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
got the same exception below.

spark-submit --master yarn-client

Here is a stack trace of the exception:

15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
executors
Exception in thread "Yarn application state monitor" 
org.apache.spark.SparkException: Error asking standalone schedule
r to shut down executors
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
d.scala:261)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
66)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
139)
Caused by: java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java
:1326)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
a:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

java.lang.NullPointerException
at org.apache.spark.sql.SQLContext.(SQLContext.scala:193)
at 
org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033)
at $iwC$$iwC.(:9)
at $iwC.(:18)
at (:20)
at .(:24)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at 

[jira] [Created] (SPARK-9485) Failed to connect to yarn

2015-07-30 Thread Philip Adetiloye (JIRA)
Philip Adetiloye created SPARK-9485:
---

 Summary: Failed to connect to yarn
 Key: SPARK-9485
 URL: https://issues.apache.org/jira/browse/SPARK-9485
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Spark Submit, YARN
Affects Versions: 1.4.1
 Environment: DEV
Reporter: Philip Adetiloye
Priority: Minor


Spark-submit throws an exception when connecting to yarn but it works when  
used in standalone mode.

I'm using spark-1.4.1-bin-hadoop2.6 and also tried compiling from source but 
got the same exception below.

Here is a stack trace of the exception:

15/07/29 17:32:15 INFO scheduler.DAGScheduler: Stopping DAGScheduler
15/07/29 17:32:15 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
executors
Exception in thread "Yarn application state monitor" 
org.apache.spark.SparkException: Error asking standalone schedule
r to shut down executors
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBacken
d.scala:261)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:2
66)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:
139)
Caused by: java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java
:1326)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scal
a:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

java.lang.NullPointerException
at org.apache.spark.sql.SQLContext.(SQLContext.scala:193)
at 
org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1033)
at $iwC$$iwC.(:9)
at $iwC.(:18)
at (:20)
at .(:24)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInt

[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark

2015-07-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648192#comment-14648192
 ] 

Joseph K. Bradley commented on SPARK-6227:
--

That's great you're interested.  Please read this for lots of helpful info: 
[https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark]

I would download the original source code from the Apache Spark website and 
install it natively, without using the VM.  There are instructions for that in 
the Spark docs and READMEs.  To get started, I recommend finding some small 
JIRAs which have been resolved already and looking at the PRs which solved 
them.  Those will give you an idea of the code structure.  Good luck!

> PCA and SVD for PySpark
> ---
>
> Key: SPARK-6227
> URL: https://issues.apache.org/jira/browse/SPARK-6227
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.2.1
>Reporter: Julien Amelot
>
> The Dimensionality Reduction techniques are not available via Python (Scala + 
> Java only).
> * Principal component analysis (PCA)
> * Singular value decomposition (SVD)
> Doc:
> http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9478) Add class weights to Random Forest

2015-07-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648184#comment-14648184
 ] 

Joseph K. Bradley commented on SPARK-9478:
--

This sounds valuable.  Handling it by reweighting examples (as is being done 
for logreg) seems like the simplest solution for now.  I'll keep an eye on the 
ticket!

> Add class weights to Random Forest
> --
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7583) User guide update for RegexTokenizer

2015-07-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648182#comment-14648182
 ] 

Joseph K. Bradley commented on SPARK-7583:
--

Yes, please!  This can go in after the feature freeze.

> User guide update for RegexTokenizer
> 
>
> Key: SPARK-7583
> URL: https://issues.apache.org/jira/browse/SPARK-7583
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Copied from [SPARK-7443]:
> {quote}
> Now that we have algorithms in spark.ml which are not in spark.mllib, we 
> should start making subsections for the spark.ml API as needed. We can follow 
> the structure of the spark.mllib user guide.
> * The spark.ml user guide can provide: (a) code examples and (b) info on 
> algorithms which do not exist in spark.mllib.
> * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
> still the primary API, we should provide links to the corresponding 
> algorithms in the spark.mllib user guide for more info.
> {quote}
> Note: I created a new subsection for links to spark.ml-specific guides in 
> this JIRA's PR: [SPARK-7557]. This transformer can go within the new 
> subsection. I'll try to get that PR merged ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec

2015-07-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648180#comment-14648180
 ] 

Joseph K. Bradley commented on SPARK-5692:
--

This was not, but thanks for the reminder; it'd be nice to add.  I'll make and 
link a JIRA for it

> Model import/export for Word2Vec
> 
>
> Key: SPARK-5692
> URL: https://issues.apache.org/jira/browse/SPARK-5692
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Manoj Kumar
> Fix For: 1.4.0
>
>
> Supoort save and load for Word2VecModel. We may want to discuss whether we 
> want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9484) Word2Vec import/export for original binary format

2015-07-30 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-9484:


 Summary: Word2Vec import/export for original binary format
 Key: SPARK-9484
 URL: https://issues.apache.org/jira/browse/SPARK-9484
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor


It would be nice to add model import/export for Word2Vec which handles the 
original binary format used by [https://code.google.com/p/word2vec/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9483) UTF8String.getPrefix only works in little-endian order

2015-07-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9483:
---
Description: There are 2 bit masking and a reverse bytes that should 
probably be handled differently on big-endian order. 

> UTF8String.getPrefix only works in little-endian order
> --
>
> Key: SPARK-9483
> URL: https://issues.apache.org/jira/browse/SPARK-9483
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> There are 2 bit masking and a reverse bytes that should probably be handled 
> differently on big-endian order. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6805) MLlib + SparkR integration for 1.5

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6805:
-
Description: 
--SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API 
in SparkR. The implementation should be similar to the pipeline API 
implementation in Python.--

We limited the scope of this JIRA to MLlib + SparkR integration for 1.5.

For Spark 1.5, we want to support linear/logistic regression in SparkR, with 
basic support for R formula and elastic-net regularization. The design doc can 
be viewed at 
https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing

  was:
~~SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API 
in SparkR. The implementation should be similar to the pipeline API 
implementation in Python.~~

We limited the scope of this JIRA to MLlib + SparkR integration for 1.5.

For Spark 1.5, we want to support linear/logistic regression in SparkR, with 
basic support for R formula and elastic-net regularization. The design doc can 
be viewed at 
https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing


> MLlib + SparkR integration for 1.5
> --
>
> Key: SPARK-6805
> URL: https://issues.apache.org/jira/browse/SPARK-6805
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Eric Liang
>Priority: Critical
>
> --SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API 
> in SparkR. The implementation should be similar to the pipeline API 
> implementation in Python.--
> We limited the scope of this JIRA to MLlib + SparkR integration for 1.5.
> For Spark 1.5, we want to support linear/logistic regression in SparkR, with 
> basic support for R formula and elastic-net regularization. The design doc 
> can be viewed at 
> https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6805) MLlib + SparkR integration for 1.5

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6805:
-
Description: 
~~SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API 
in SparkR. The implementation should be similar to the pipeline API 
implementation in Python.~~

We limited the scope of this JIRA to MLlib + SparkR integration for 1.5.

For Spark 1.5, we want to support linear/logistic regression in SparkR, with 
basic support for R formula and elastic-net regularization. The design doc can 
be viewed at 
https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing

  was:
SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API in 
SparkR. The implementation should be similar to the pipeline API implementation 
in Python.

For Spark 1.5, we want to support linear/logistic regression in SparkR, with 
basic support for R formula and elastic-net regularization. The design doc can 
be viewed at 
https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing


> MLlib + SparkR integration for 1.5
> --
>
> Key: SPARK-6805
> URL: https://issues.apache.org/jira/browse/SPARK-6805
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Eric Liang
>Priority: Critical
>
> ~~SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API 
> in SparkR. The implementation should be similar to the pipeline API 
> implementation in Python.~~
> We limited the scope of this JIRA to MLlib + SparkR integration for 1.5.
> For Spark 1.5, we want to support linear/logistic regression in SparkR, with 
> basic support for R formula and elastic-net regularization. The design doc 
> can be viewed at 
> https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6805) MLlib + SparkR integration for 1.5

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6805:
-
Assignee: Eric Liang

> MLlib + SparkR integration for 1.5
> --
>
> Key: SPARK-6805
> URL: https://issues.apache.org/jira/browse/SPARK-6805
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Eric Liang
>Priority: Critical
>
> SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API 
> in SparkR. The implementation should be similar to the pipeline API 
> implementation in Python.
> For Spark 1.5, we want to support linear/logistic regression in SparkR, with 
> basic support for R formula and elastic-net regularization. The design doc 
> can be viewed at 
> https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6805) MLlib + SparkR integration for 1.5

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6805:
-
Summary: MLlib + SparkR integration for 1.5  (was: ML Pipeline API in 
SparkR)

> MLlib + SparkR integration for 1.5
> --
>
> Key: SPARK-6805
> URL: https://issues.apache.org/jira/browse/SPARK-6805
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Priority: Critical
>
> SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API 
> in SparkR. The implementation should be similar to the pipeline API 
> implementation in Python.
> For Spark 1.5, we want to support linear/logistic regression in SparkR, with 
> basic support for R formula and elastic-net regularization. The design doc 
> can be viewed at 
> https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9463) Expose model coefficients with names in SparkR RFormula

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9463:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-6805

> Expose model coefficients with names in SparkR RFormula
> ---
>
> Key: SPARK-9463
> URL: https://issues.apache.org/jira/browse/SPARK-9463
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Eric Liang
>Assignee: Eric Liang
>
> Currently you cannot retrieve model statistics from the R side, we should at 
> least allow showing the coefficients for 1.5
> Design doc from umbrella task: 
> https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9483) UTF8String.getPrefix only works in little-endian order

2015-07-30 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-9483:
--

 Summary: UTF8String.getPrefix only works in little-endian order
 Key: SPARK-9483
 URL: https://issues.apache.org/jira/browse/SPARK-9483
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9463) Expose model coefficients with names in SparkR RFormula

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9463:
-
Assignee: Eric Liang

> Expose model coefficients with names in SparkR RFormula
> ---
>
> Key: SPARK-9463
> URL: https://issues.apache.org/jira/browse/SPARK-9463
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Eric Liang
>Assignee: Eric Liang
>
> Currently you cannot retrieve model statistics from the R side, we should at 
> least allow showing the coefficients for 1.5
> Design doc from umbrella task: 
> https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9482) flaky test: org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.semijoin

2015-07-30 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-9482:


 Summary: flaky test: 
org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.semijoin
 Key: SPARK-9482
 URL: https://issues.apache.org/jira/browse/SPARK-9482
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Yin Huai
Priority: Critical


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39059/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/semijoin/

{code}
Regression

org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.semijoin

Failing for the past 1 build (Since Failed#39059 )
Took 7.7 sec.
Error Message

 Results do not match for semijoin: == Parsed Logical Plan == 'Sort ['a.key 
ASC], false  'Project [unresolvedalias('a.key)]   'Join RightOuter, 
Some(('a.key = 'c.key))'Join LeftSemi, Some(('a.key = 'b.key)) 
'UnresolvedRelation [t3], Some(a) 'UnresolvedRelation [t2], Some(b)
'UnresolvedRelation [t1], Some(c)  == Analyzed Logical Plan == key: int Sort 
[key#176228 ASC], false  Project [key#176228]   Join RightOuter, 
Some((key#176228 = key#176232))Join LeftSemi, Some((key#176228 = 
key#176230)) MetastoreRelation default, t3, Some(a) MetastoreRelation 
default, t2, Some(b)MetastoreRelation default, t1, Some(c)  == Optimized 
Logical Plan == Sort [key#176228 ASC], false  Project [key#176228]   Join 
RightOuter, Some((key#176228 = key#176232))Project [key#176228] Join 
LeftSemi, Some((key#176228 = key#176230))  Project [key#176228]   
MetastoreRelation default, t3, Some(a)  Project [key#176230]   
MetastoreRelation default, t2, Some(b)Project [key#176232] 
MetastoreRelation default, t1, Some(c)  == Physical Plan == ExternalSort 
[key#176228 ASC], false  Project [key#176228]   ConvertToSafe
BroadcastHashOuterJoin [key#176228], [key#176232], RightOuter, None 
ConvertToUnsafe  Project [key#176228]   ConvertToSafe
BroadcastLeftSemiJoinHash [key#176228], [key#176230], None 
ConvertToUnsafe  HiveTableScan [key#176228], (MetastoreRelation 
default, t3, Some(a)) ConvertToUnsafe  HiveTableScan 
[key#176230], (MetastoreRelation default, t2, Some(b)) ConvertToUnsafe  
HiveTableScan [key#176232], (MetastoreRelation default, t1, Some(c))  Code 
Generation: true == RDD == key !== HIVE - 31 row(s) ==   == CATALYST - 30 
row(s) ==  00  00  0
0  00  00  0
0  00  00  0
0  00  00  
00  00  0   
 0  00  00  0   
 0  00  10   10  10 
  10  10   10  10   10 !4   
 8 !48 !8NULL 
!8NULL  NULL NULL  NULL 
NULL  NULL NULL  NULL NULL 
!NULL  
Stacktrace

sbt.ForkMain$ForkError: 
Results do not match for semijoin:
== Parsed Logical Plan ==
'Sort ['a.key ASC], false
 'Project [unresolvedalias('a.key)]
  'Join RightOuter, Some(('a.key = 'c.key))
   'Join LeftSemi, Some(('a.key = 'b.key))
'UnresolvedRelation [t3], Some(a)
'UnresolvedRelation [t2], Some(b)
   'UnresolvedRelation [t1], Some(c)

== Analyzed Logical Plan ==
key: int
Sort [key#176228 ASC], false
 Project [key#176228]
  Join RightOuter, Some((key#176228 = key#176232))
   Join LeftSemi, Some((key#176228 = key#176230))
MetastoreRelation default, t3, Some(a)
MetastoreRelation default, t2, Some(b)
   MetastoreRelation default, t1, Some(c)

== Optimized Logical Plan ==
Sort [key#176228 ASC], false
 Project [key#176228]
  Join RightOuter, Some((key#176228 = key#176232))
   Project [key#176228]
Join LeftSemi, Some((key#176228 = key#176230))
 Project [key#176228]
  MetastoreRelation default, t3, Some(a)
 Project [key#176230]
  MetastoreRelation default, t2, Some(b)
   Project [key#176232]
MetastoreRelation default, t1, Some(c)

== Physical Plan ==
ExternalSort [key#176228 ASC], false
 Project [key#176228]
  ConvertToSafe
   BroadcastHashOuterJoin [key#176228], [key#176232], RightOuter, None
ConvertToUnsafe
 Project [key#176228]
  ConvertToSafe
   BroadcastLeftSemiJoinHash [key#176228], [key#176230], None
ConvertToUnsafe
 HiveTableScan [key#176228], (MetastoreRelation default, t3, Some(a))
C

[jira] [Updated] (SPARK-9482) flaky test: org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.semijoin

2015-07-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9482:
-
Labels: flaky-test  (was: )

> flaky test: 
> org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.semijoin
> ---
>
> Key: SPARK-9482
> URL: https://issues.apache.org/jira/browse/SPARK-9482
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Yin Huai
>Priority: Critical
>  Labels: flaky-test
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39059/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/semijoin/
> {code}
> Regression
> org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.semijoin
> Failing for the past 1 build (Since Failed#39059 )
> Took 7.7 sec.
> Error Message
>  Results do not match for semijoin: == Parsed Logical Plan == 'Sort ['a.key 
> ASC], false  'Project [unresolvedalias('a.key)]   'Join RightOuter, 
> Some(('a.key = 'c.key))'Join LeftSemi, Some(('a.key = 'b.key)) 
> 'UnresolvedRelation [t3], Some(a) 'UnresolvedRelation [t2], Some(b)
> 'UnresolvedRelation [t1], Some(c)  == Analyzed Logical Plan == key: int Sort 
> [key#176228 ASC], false  Project [key#176228]   Join RightOuter, 
> Some((key#176228 = key#176232))Join LeftSemi, Some((key#176228 = 
> key#176230)) MetastoreRelation default, t3, Some(a) MetastoreRelation 
> default, t2, Some(b)MetastoreRelation default, t1, Some(c)  == Optimized 
> Logical Plan == Sort [key#176228 ASC], false  Project [key#176228]   Join 
> RightOuter, Some((key#176228 = key#176232))Project [key#176228] Join 
> LeftSemi, Some((key#176228 = key#176230))  Project [key#176228]   
> MetastoreRelation default, t3, Some(a)  Project [key#176230]   
> MetastoreRelation default, t2, Some(b)Project [key#176232] 
> MetastoreRelation default, t1, Some(c)  == Physical Plan == ExternalSort 
> [key#176228 ASC], false  Project [key#176228]   ConvertToSafe
> BroadcastHashOuterJoin [key#176228], [key#176232], RightOuter, None 
> ConvertToUnsafe  Project [key#176228]   ConvertToSafe
> BroadcastLeftSemiJoinHash [key#176228], [key#176230], None 
> ConvertToUnsafe  HiveTableScan [key#176228], (MetastoreRelation 
> default, t3, Some(a)) ConvertToUnsafe  HiveTableScan 
> [key#176230], (MetastoreRelation default, t2, Some(b)) ConvertToUnsafe
>   HiveTableScan [key#176232], (MetastoreRelation default, t1, Some(c))  Code 
> Generation: true == RDD == key !== HIVE - 31 row(s) ==   == CATALYST - 30 
> row(s) ==  00  00  0  
>   0  00  00  0
> 0  00  00 
>  00  00  0
> 0  00  00  0  
>   0  00  00  0
> 0  00  10   10  
> 10   10  10   10  10  
>  10 !48 !48 !8
> NULL !8NULL  NULL 
> NULL  NULL NULL  NULL NULL  NULL  
>NULL !NULL  
> Stacktrace
> sbt.ForkMain$ForkError: 
> Results do not match for semijoin:
> == Parsed Logical Plan ==
> 'Sort ['a.key ASC], false
>  'Project [unresolvedalias('a.key)]
>   'Join RightOuter, Some(('a.key = 'c.key))
>'Join LeftSemi, Some(('a.key = 'b.key))
> 'UnresolvedRelation [t3], Some(a)
> 'UnresolvedRelation [t2], Some(b)
>'UnresolvedRelation [t1], Some(c)
> == Analyzed Logical Plan ==
> key: int
> Sort [key#176228 ASC], false
>  Project [key#176228]
>   Join RightOuter, Some((key#176228 = key#176232))
>Join LeftSemi, Some((key#176228 = key#176230))
> MetastoreRelation default, t3, Some(a)
> MetastoreRelation default, t2, Some(b)
>MetastoreRelation default, t1, Some(c)
> == Optimized Logical Plan ==
> Sort [key#176228 ASC], false
>  Project [key#176228]
>   Join RightOuter, Some((key#176228 = key#176232))
>Project [key#176228]
> Join LeftSemi, Some((key#176228 = key#176230))
>  Project [key#176228]
>   MetastoreRelation default, t3, Some(a)
>  Project [key#176230]
>   MetastoreRelation default, t2, Some(b)
>Project [key#176232]
> MetastoreRelation default, t1, Some(c)
>

[jira] [Commented] (SPARK-8497) Graph Clique(Complete Connected Sub-graph) Discovery Algorithm

2015-07-30 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648126#comment-14648126
 ] 

Xiangrui Meng commented on SPARK-8497:
--

Please provide the algorithm you want to implement, which should be based on 
some published work for correctness. I don't know how to handle the exponential 
growth of number of cliques. For example, if we have a clique of size 40, there 
will be (40 choose 20) cliques of size 20, which is more than 100 billion.

> Graph Clique(Complete Connected Sub-graph) Discovery Algorithm
> --
>
> Key: SPARK-8497
> URL: https://issues.apache.org/jira/browse/SPARK-8497
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, ML, MLlib, Spark Core
>Reporter: Fan Jiang
>Assignee: Fan Jiang
>  Labels: features
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In recent years, social network industry has high demand on Complete 
> Connected Sub-Graph Discoveries, so does Telecom. Similar as the graph 
> connection from Twitter, the calls and other activities from telecoms world 
> form a huge social graph, and due to the nature of communication method, it 
> shows the strongest inter-person relationship, the graph based analysis will 
> reveal tremendous value from telecoms connections. 
> We need an algorithm in Spark to figure out ALL the strongest completely 
> connected sub-graph (so called Clique here) for EVERY person in the network 
> which will be one of the start point for understanding user's social 
> behaviour. 
> In Huawei, we have many real-world use cases that invovle telecom social 
> graph of tens billion edges and hundreds million vertices, and the cliques 
> will be also in tens million level. The graph will be a fast changing one 
> which means we need to analyse the graph pattern very often (one result per 
> day/week for moving time window which spans multiple months). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4449) specify port range in spark

2015-07-30 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648090#comment-14648090
 ] 

Neelesh Srinivas Salian commented on SPARK-4449:


I would like to pick this up and work on it.

Could you please assign the JIRA to me?

Thank you.


> specify port range in spark
> ---
>
> Key: SPARK-4449
> URL: https://issues.apache.org/jira/browse/SPARK-4449
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Fei Wang
>Priority: Minor
>
>  In some case, we need specify port range used in spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8174) date/time function: unix_timestamp

2015-07-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8174.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7644
[https://github.com/apache/spark/pull/7644]

> date/time function: unix_timestamp
> --
>
> Key: SPARK-8174
> URL: https://issues.apache.org/jira/browse/SPARK-8174
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Adrian Wang
>Priority: Blocker
> Fix For: 1.5.0
>
>
> 3 variants:
> {code}
> unix_timestamp(): long
> Gets current Unix timestamp in seconds.
> unix_timestamp(string|date): long
> Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in 
> seconds), using the default timezone and the default locale, return 0 if 
> fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801
> unix_timestamp(string date, string pattern): long
> Convert time string with given pattern (see 
> [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) 
> to Unix time stamp (in seconds), return 0 if fail: 
> unix_timestamp('2009-03-20', '-MM-dd') = 1237532400.
> {code}
> See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8175) date/time function: from_unixtime

2015-07-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8175.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7644
[https://github.com/apache/spark/pull/7644]

> date/time function: from_unixtime
> -
>
> Key: SPARK-8175
> URL: https://issues.apache.org/jira/browse/SPARK-8175
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Adrian Wang
> Fix For: 1.5.0
>
>
> from_unixtime(bigint unixtime[, string format]): string
> Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a 
> string representing the timestamp of that moment in the current system time 
> zone in the format of "1970-01-01 00:00:00".
> See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-9282) Filter on Spark DataFrame with multiple columns

2015-07-30 Thread Sandeep Pal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal reopened SPARK-9282:


On using '&' instead of 'and', the following error occurs:

Py4JError Traceback (most recent call last)
 in ()
> 1 df1.filter(df1.age > 21 & df1.age < 45).show(10)

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py in 
_(self, other)
999 def _(self, other):
   1000 jc = other._jc if isinstance(other, Column) else other
-> 1001 njc = getattr(self._jc, name)(jc)
   1002 return Column(njc)
   1003 _.__doc__ = doc

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539 
540 for temp_arg in temp_args:

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
302 raise Py4JError(
303 'An error occurred while calling {0}{1}{2}. 
Trace:\n{3}\n'.
--> 304 format(target_id, '.', name, value))
305 else:
306 raise Py4JError(

Py4JError: An error occurred while calling o83.and. Trace:
py4j.Py4JException: Method and([class java.lang.Integer]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

> Filter on Spark DataFrame with multiple columns
> ---
>
> Key: SPARK-9282
> URL: https://issues.apache.org/jira/browse/SPARK-9282
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, SQL
>Affects Versions: 1.3.0
> Environment: CDH 5.0 on CentOS6
>Reporter: Sandeep Pal
>
> Filter on dataframe does not work if we have more than one column inside the 
> filter. Nonetheless, it works on an RDD.
> Following is the example:
> df1.show()
> age coolid depid empname
> 23  7  1 sandeep
> 21  8  2 john   
> 24  9  1 cena   
> 45  12 3 bob
> 20  7  4 tanay  
> 12  8  5 gaurav 
> df1.filter(df1.age > 21 and df1.age < 45).show(10)
> 23  7  1 sandeep
> 21  8  2 john   <-
> 24  9  1 cena   
> 20  7  4 tanay <-
> 12  8  5 gaurav   <--



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9481) LocalLDAModel logLikelihood

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9481:
---

Assignee: (was: Apache Spark)

> LocalLDAModel logLikelihood
> ---
>
> Key: SPARK-9481
> URL: https://issues.apache.org/jira/browse/SPARK-9481
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Feynman Liang
>Priority: Trivial
>
> We already have a variational {{bound}} method so we should provide a public 
> {{logLikelihood}} that uses the model's parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9481) LocalLDAModel logLikelihood

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9481:
---

Assignee: Apache Spark

> LocalLDAModel logLikelihood
> ---
>
> Key: SPARK-9481
> URL: https://issues.apache.org/jira/browse/SPARK-9481
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Feynman Liang
>Assignee: Apache Spark
>Priority: Trivial
>
> We already have a variational {{bound}} method so we should provide a public 
> {{logLikelihood}} that uses the model's parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9481) LocalLDAModel logLikelihood

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648039#comment-14648039
 ] 

Apache Spark commented on SPARK-9481:
-

User 'feynmanliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7801

> LocalLDAModel logLikelihood
> ---
>
> Key: SPARK-9481
> URL: https://issues.apache.org/jira/browse/SPARK-9481
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Feynman Liang
>Priority: Trivial
>
> We already have a variational {{bound}} method so we should provide a public 
> {{logLikelihood}} that uses the model's parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9437) SizeEstimator overflows for primitive arrays

2015-07-30 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648025#comment-14648025
 ] 

Shivaram Venkataraman commented on SPARK-9437:
--

Resolved by https://github.com/apache/spark/pull/7750

> SizeEstimator overflows for primitive arrays
> 
>
> Key: SPARK-9437
> URL: https://issues.apache.org/jira/browse/SPARK-9437
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 1.5.0
>
>
> {{SizeEstimator}} can overflow when dealing w/ large primitive arrays eg if 
> you have an {{Array[Double]}} of size 1 << 28.  This means that when you try 
> to broadcast a large primitive array, you get:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed: sizeInBytes was 
> negative: -2147483608
>at scala.Predef$.require(Predef.scala:233)
>at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
>at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:815)
>at 
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9437) SizeEstimator overflows for primitive arrays

2015-07-30 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-9437.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

> SizeEstimator overflows for primitive arrays
> 
>
> Key: SPARK-9437
> URL: https://issues.apache.org/jira/browse/SPARK-9437
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 1.5.0
>
>
> {{SizeEstimator}} can overflow when dealing w/ large primitive arrays eg if 
> you have an {{Array[Double]}} of size 1 << 28.  This means that when you try 
> to broadcast a large primitive array, you get:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed: sizeInBytes was 
> negative: -2147483608
>at scala.Predef$.require(Predef.scala:233)
>at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
>at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:815)
>at 
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8850) Turn unsafe mode on by default

2015-07-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8850.

   Resolution: Fixed
Fix Version/s: 1.5.0

> Turn unsafe mode on by default
> --
>
> Key: SPARK-8850
> URL: https://issues.apache.org/jira/browse/SPARK-8850
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Josh Rosen
> Fix For: 1.5.0
>
>
> Let's turn unsafe on and see what bugs we find in preparation for 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9388) Make log messages in ExecutorRunnable more readable

2015-07-30 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-9388.
---
   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 1.5.0

> Make log messages in ExecutorRunnable more readable
> ---
>
> Key: SPARK-9388
> URL: https://issues.apache.org/jira/browse/SPARK-9388
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.5.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Trivial
> Fix For: 1.5.0
>
>
> There's a couple of debug messages printed in ExecutorRunnable containing 
> information about the container being started. They're printed all in one 
> line, which makes them - especially the one containing the process's 
> environment - hard to read.
> We should make them nicer (like the similar one printed by Client.scala).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8297) Scheduler backend is not notified in case node fails in YARN

2015-07-30 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-8297.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

> Scheduler backend is not notified in case node fails in YARN
> 
>
> Key: SPARK-8297
> URL: https://issues.apache.org/jira/browse/SPARK-8297
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.0
> Environment: Spark on yarn - both client and cluster mode.
>Reporter: Mridul Muralidharan
>Assignee: Mridul Muralidharan
>Priority: Critical
> Fix For: 1.5.0
>
>
> When a node crashes, yarn detects the failure and notifies spark - but this 
> information is not propagated to scheduler backend (unlike in mesos mode, for 
> example).
> It results in repeated re-execution of stages (due to FetchFailedException on 
> shuffle side), resulting finally in application failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9361) Refactor new aggregation code to reduce the times of checking compatibility

2015-07-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9361:

Assignee: Liang-Chi Hsieh

> Refactor new aggregation code to reduce the times of checking compatibility
> ---
>
> Key: SPARK-9361
> URL: https://issues.apache.org/jira/browse/SPARK-9361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 1.5.0
>
>
> Currently, we call aggregate.Utils.tryConvert in many places to check it the 
> logical.aggregate can be run with new aggregation. But looks like 
> aggregate.Utils.tryConvert costs much time to run. We should only call 
> tryConvert once and keep it value in logical.aggregate and reuse it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9361) Refactor new aggregation code to reduce the times of checking compatibility

2015-07-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-9361.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7677
[https://github.com/apache/spark/pull/7677]

> Refactor new aggregation code to reduce the times of checking compatibility
> ---
>
> Key: SPARK-9361
> URL: https://issues.apache.org/jira/browse/SPARK-9361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 1.5.0
>
>
> Currently, we call aggregate.Utils.tryConvert in many places to check it the 
> logical.aggregate can be run with new aggregation. But looks like 
> aggregate.Utils.tryConvert costs much time to run. We should only call 
> tryConvert once and keep it value in logical.aggregate and reuse it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9361) Refactor new aggregation code to reduce the times of checking compatibility

2015-07-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9361:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-4366

> Refactor new aggregation code to reduce the times of checking compatibility
> ---
>
> Key: SPARK-9361
> URL: https://issues.apache.org/jira/browse/SPARK-9361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Currently, we call aggregate.Utils.tryConvert in many places to check it the 
> logical.aggregate can be run with new aggregation. But looks like 
> aggregate.Utils.tryConvert costs much time to run. We should only call 
> tryConvert once and keep it value in logical.aggregate and reuse it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9481) LocalLDAModel logLikelihood

2015-07-30 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-9481:
-
Issue Type: Improvement  (was: Sub-task)
Parent: (was: SPARK-5572)

> LocalLDAModel logLikelihood
> ---
>
> Key: SPARK-9481
> URL: https://issues.apache.org/jira/browse/SPARK-9481
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Feynman Liang
>Priority: Trivial
>
> We already have a variational {{bound}} method so we should provide a public 
> {{logLikelihood}} that uses the model's parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9481) LocalLDAModel logLikelihood

2015-07-30 Thread Feynman Liang (JIRA)
Feynman Liang created SPARK-9481:


 Summary: LocalLDAModel logLikelihood
 Key: SPARK-9481
 URL: https://issues.apache.org/jira/browse/SPARK-9481
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Feynman Liang
Priority: Trivial


We already have a variational {{bound}} method so we should provide a public 
{{logLikelihood}} that uses the model's parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9481) LocalLDAModel logLikelihood

2015-07-30 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647991#comment-14647991
 ] 

Feynman Liang commented on SPARK-9481:
--

Working on this

> LocalLDAModel logLikelihood
> ---
>
> Key: SPARK-9481
> URL: https://issues.apache.org/jira/browse/SPARK-9481
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Feynman Liang
>Priority: Trivial
>
> We already have a variational {{bound}} method so we should provide a public 
> {{logLikelihood}} that uses the model's parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9267) Remove highly unnecessary accumulators stringify methods

2015-07-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9267:
-
Assignee: François Garillot

> Remove highly unnecessary accumulators stringify methods
> 
>
> Key: SPARK-9267
> URL: https://issues.apache.org/jira/browse/SPARK-9267
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: François Garillot
>Priority: Trivial
> Fix For: 1.5.0
>
>
> {code}
> def stringifyPartialValue(partialValue: Any): String = 
> "%s".format(partialValue)
> def stringifyValue(value: Any): String = "%s".format(value)
> {code}
> These are only used in 1 place (DAGScheduler). The level of indirection 
> actually makes the code harder to read without an editor. We should just 
> inline them...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9267) Remove highly unnecessary accumulators stringify methods

2015-07-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9267.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7678
[https://github.com/apache/spark/pull/7678]

> Remove highly unnecessary accumulators stringify methods
> 
>
> Key: SPARK-9267
> URL: https://issues.apache.org/jira/browse/SPARK-9267
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Priority: Trivial
> Fix For: 1.5.0
>
>
> {code}
> def stringifyPartialValue(partialValue: Any): String = 
> "%s".format(partialValue)
> def stringifyValue(value: Any): String = "%s".format(value)
> {code}
> These are only used in 1 place (DAGScheduler). The level of indirection 
> actually makes the code harder to read without an editor. We should just 
> inline them...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9390) Create an array abstract class ArrayData and a default implementation backed by Array[Object]

2015-07-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9390.

   Resolution: Fixed
Fix Version/s: 1.5.0

> Create an array abstract class ArrayData and a default implementation backed 
> by Array[Object]
> -
>
> Key: SPARK-9390
> URL: https://issues.apache.org/jira/browse/SPARK-9390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
> Fix For: 1.5.0
>
>
> {code}
> interface ArrayData implements SpecializedGetters {
>   int numElements();
>   int sizeInBytes();
> }
> {code}
> We should also add to SpecializedGetters a method to get array, i.e.
> {code}
> interface SpecializedGetters {
>   ...
>   ArrayData getArray(int ordinal);
>   ...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored

2015-07-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2089.
--
Resolution: Won't Fix

> With YARN, preferredNodeLocalityData isn't honored 
> ---
>
> Key: SPARK-2089
> URL: https://issues.apache.org/jira/browse/SPARK-2089
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>Priority: Critical
>
> When running in YARN cluster mode, apps can pass preferred locality data when 
> constructing a Spark context that will dictate where to request executor 
> containers.
> This is currently broken because of a race condition.  The Spark-YARN code 
> runs the user class and waits for it to start up a SparkContext.  During its 
> initialization, the SparkContext will create a YarnClusterScheduler, which 
> notifies a monitor in the Spark-YARN code that .  The Spark-Yarn code then 
> immediately fetches the preferredNodeLocationData from the SparkContext and 
> uses it to start requesting containers.
> But in the SparkContext constructor that takes the preferredNodeLocationData, 
> setting preferredNodeLocationData comes after the rest of the initialization, 
> so, if the Spark-YARN code comes around quickly enough after being notified, 
> the data that's fetched is the empty unset version.  The occurred during all 
> of my runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9248) Closing curly-braces should always be on their own line

2015-07-30 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-9248:
-
Assignee: Yu Ishikawa

> Closing curly-braces should always be on their own line
> ---
>
> Key: SPARK-9248
> URL: https://issues.apache.org/jira/browse/SPARK-9248
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>Assignee: Yu Ishikawa
>Priority: Minor
> Fix For: 1.5.0
>
>
> Closing curly-braces should always be on their own line
> For example,
> {noformat}
> inst/tests/test_sparkSQL.R:606:3: style: Closing curly-braces should always 
> be on their own line, unless it's followed by an else.
>   }, error = function(err) {
>   ^
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9248) Closing curly-braces should always be on their own line

2015-07-30 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-9248.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7795
[https://github.com/apache/spark/pull/7795]

> Closing curly-braces should always be on their own line
> ---
>
> Key: SPARK-9248
> URL: https://issues.apache.org/jira/browse/SPARK-9248
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>Priority: Minor
> Fix For: 1.5.0
>
>
> Closing curly-braces should always be on their own line
> For example,
> {noformat}
> inst/tests/test_sparkSQL.R:606:3: style: Closing curly-braces should always 
> be on their own line, unless it's followed by an else.
>   }, error = function(err) {
>   ^
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9077) Improve error message for decision trees when numExamples < maxCategoriesPerFeature

2015-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647937#comment-14647937
 ] 

Apache Spark commented on SPARK-9077:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7800

> Improve error message for decision trees when numExamples < 
> maxCategoriesPerFeature
> ---
>
> Key: SPARK-9077
> URL: https://issues.apache.org/jira/browse/SPARK-9077
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> See [SPARK-9075]'s discussion for details.  We should improve the current 
> error message to recommend that the user remove the high-arity categorical 
> features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9077) Improve error message for decision trees when numExamples < maxCategoriesPerFeature

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9077:
---

Assignee: (was: Apache Spark)

> Improve error message for decision trees when numExamples < 
> maxCategoriesPerFeature
> ---
>
> Key: SPARK-9077
> URL: https://issues.apache.org/jira/browse/SPARK-9077
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Trivial
>  Labels: starter
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> See [SPARK-9075]'s discussion for details.  We should improve the current 
> error message to recommend that the user remove the high-arity categorical 
> features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9077) Improve error message for decision trees when numExamples < maxCategoriesPerFeature

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9077:
---

Assignee: Apache Spark

> Improve error message for decision trees when numExamples < 
> maxCategoriesPerFeature
> ---
>
> Key: SPARK-9077
> URL: https://issues.apache.org/jira/browse/SPARK-9077
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: starter
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> See [SPARK-9075]'s discussion for details.  We should improve the current 
> error message to recommend that the user remove the high-arity categorical 
> features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9477) Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos).

2015-07-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647910#comment-14647910
 ] 

Sean Owen commented on SPARK-9477:
--

Seems reasonable to me -- anybody else have an opinion? If not after a day or 
two I'll update the wiki.

> Adding IBM Platform Application Service Controller into Spark documentation 
> as a supported Cluster Manager (beside Yarn and Mesos). 
> 
>
> Key: SPARK-9477
> URL: https://issues.apache.org/jira/browse/SPARK-9477
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.4.1
>Reporter: Stacy Pedersen
>Priority: Minor
> Fix For: 1.4.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7583) User guide update for RegexTokenizer

2015-07-30 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647904#comment-14647904
 ] 

yuhao yang commented on SPARK-7583:
---

I'd like to take a try if this is still needed.

> User guide update for RegexTokenizer
> 
>
> Key: SPARK-7583
> URL: https://issues.apache.org/jira/browse/SPARK-7583
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Copied from [SPARK-7443]:
> {quote}
> Now that we have algorithms in spark.ml which are not in spark.mllib, we 
> should start making subsections for the spark.ml API as needed. We can follow 
> the structure of the spark.mllib user guide.
> * The spark.ml user guide can provide: (a) code examples and (b) info on 
> algorithms which do not exist in spark.mllib.
> * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
> still the primary API, we should provide links to the corresponding 
> algorithms in the spark.mllib user guide for more info.
> {quote}
> Note: I created a new subsection for links to spark.ml-specific guides in 
> this JIRA's PR: [SPARK-7557]. This transformer can go within the new 
> subsection. I'll try to get that PR merged ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark

2015-07-30 Thread K S Sreenivasa Raghavan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647893#comment-14647893
 ] 

K S Sreenivasa Raghavan commented on SPARK-6227:


Hi,
I have worked with pySpark in EdX courses. The course coordinators distributed 
some sparkVM to all participants of the course. I am interested in developing 
this package. I have even learnt scala. I have few doubts:

1.  Please give me proper steps to install spark on my ubuntu desktop as I have 
no idea how to modify the spark code in VM. I tried all the methods given (as 
given by google search), But they failed.
2. For Pyspark, where should we write/ modify the codes?


> PCA and SVD for PySpark
> ---
>
> Key: SPARK-6227
> URL: https://issues.apache.org/jira/browse/SPARK-6227
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.2.1
>Reporter: Julien Amelot
>
> The Dimensionality Reduction techniques are not available via Python (Scala + 
> Java only).
> * Principal component analysis (PCA)
> * Singular value decomposition (SVD)
> Doc:
> http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9480) Create an map abstract class MapData and a default implementation backed by 2 ArrayData

2015-07-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9480:
---

Assignee: Apache Spark

> Create an map abstract class MapData and a default implementation backed by 2 
> ArrayData
> ---
>
> Key: SPARK-9480
> URL: https://issues.apache.org/jira/browse/SPARK-9480
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   >