[jira] [Created] (SPARK-6286) Handle TASK_ERROR in TaskState

2015-03-11 Thread Iulian Dragos (JIRA)
Iulian Dragos created SPARK-6286:


 Summary: Handle TASK_ERROR in TaskState
 Key: SPARK-6286
 URL: https://issues.apache.org/jira/browse/SPARK-6286
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Iulian Dragos
Priority: Minor


Scala warning:

{code}
match may not be exhaustive. It would fail on the following input: TASK_ERROR
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec

2015-03-11 Thread ANUPAM MEDIRATTA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357055#comment-14357055
 ] 

ANUPAM MEDIRATTA commented on SPARK-5692:
-

Manoj Kumar,

Not yet but plan to work on it over the weekend. Is that okay?


 Model import/export for Word2Vec
 

 Key: SPARK-5692
 URL: https://issues.apache.org/jira/browse/SPARK-5692
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: ANUPAM MEDIRATTA

 Supoort save and load for Word2VecModel. We may want to discuss whether we 
 want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec

2015-03-11 Thread ANUPAM MEDIRATTA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357053#comment-14357053
 ] 

ANUPAM MEDIRATTA commented on SPARK-5692:
-

Manoj Kumar,

Not yet but plan to work on it over the weekend. Is that okay?


 Model import/export for Word2Vec
 

 Key: SPARK-5692
 URL: https://issues.apache.org/jira/browse/SPARK-5692
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: ANUPAM MEDIRATTA

 Supoort save and load for Word2VecModel. We may want to discuss whether we 
 want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5692) Model import/export for Word2Vec

2015-03-11 Thread ANUPAM MEDIRATTA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ANUPAM MEDIRATTA updated SPARK-5692:

Comment: was deleted

(was: Manoj Kumar,

Not yet but plan to work on it over the weekend. Is that okay?
)

 Model import/export for Word2Vec
 

 Key: SPARK-5692
 URL: https://issues.apache.org/jira/browse/SPARK-5692
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: ANUPAM MEDIRATTA

 Supoort save and load for Word2VecModel. We may want to discuss whether we 
 want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec

2015-03-11 Thread ANUPAM MEDIRATTA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357057#comment-14357057
 ] 

ANUPAM MEDIRATTA commented on SPARK-5692:
-

Manoj Kumar,

Not yet but plan to work on it over the weekend. Is that okay?


 Model import/export for Word2Vec
 

 Key: SPARK-5692
 URL: https://issues.apache.org/jira/browse/SPARK-5692
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: ANUPAM MEDIRATTA

 Supoort save and load for Word2VecModel. We may want to discuss whether we 
 want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6283) Add a CassandraInputDStream to stream from a C* table

2015-03-11 Thread Helena Edelson (JIRA)
Helena Edelson created SPARK-6283:
-

 Summary: Add a CassandraInputDStream to stream from a C* table
 Key: SPARK-6283
 URL: https://issues.apache.org/jira/browse/SPARK-6283
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Helena Edelson


Add support for streaming from Cassandra to Spark Streaming - external.

Related ticket: https://datastax-oss.atlassian.net/browse/SPARKC-40 

[~helena_e] is doing the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6286) Handle TASK_ERROR in TaskState

2015-03-11 Thread Iulian Dragos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Iulian Dragos updated SPARK-6286:
-
Labels: mesos  (was: )

 Handle TASK_ERROR in TaskState
 --

 Key: SPARK-6286
 URL: https://issues.apache.org/jira/browse/SPARK-6286
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Iulian Dragos
Priority: Minor
  Labels: mesos

 Scala warning:
 {code}
 match may not be exhaustive. It would fail on the following input: TASK_ERROR
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods

2015-03-11 Thread mgdadv (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357162#comment-14357162
 ] 

mgdadv commented on SPARK-6189:
---

The current behavior really is quite confusing. In particular with R datasets 
where the period is often used.

I think it is not obvious what the correct thing to do is. The patch above 
replaces the period with an underscore. This fixes the problem, but would be 
potentially problematic if a different solution is wanted in the future, i.e. 
then scripts relying on this patch would have to be changed.

Alternatively one could just spit out a warning. The problem is that Spark is 
quite verbose and the warning might be missed. This would be the least 
intrusive solution I can think of.

Another possibility would be to raise an exception instead of just printing a 
warning.

 Pandas to DataFrame conversion should check field names for periods
 ---

 Key: SPARK-6189
 URL: https://issues.apache.org/jira/browse/SPARK-6189
 Project: Spark
  Issue Type: Improvement
  Components: DataFrame, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 Issue I ran into:  I imported an R dataset in CSV format into a Pandas 
 DataFrame and then use toDF() to convert that into a Spark DataFrame.  The R 
 dataset had a column with a period in it (column GNP.deflator in the 
 longley dataset).  When I tried to select it using the Spark DataFrame DSL, 
 I could not because the DSL thought the period was selecting a field within 
 GNP.
 Also, since GNP is another field's name, it gives an error which could be 
 obscure to users, complaining:
 {code}
 org.apache.spark.sql.AnalysisException: GetField is not valid on fields of 
 type DoubleType;
 {code}
 We should either handle periods in column names or check during loading and 
 warn/fail gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6277) Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf

2015-03-11 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357072#comment-14357072
 ] 

Marcelo Vanzin commented on SPARK-6277:
---

Sometimes I do miss being able to reference other configs and/or system 
properties / env variables from the config. It would allow some interesting use 
cases, at least from a distribution's point of view.

This should be pretty simple to achieve using the commons-config library, 
although I'd prefer avoiding the dependency since it would be more friendly to 
SPARK-4824.

 Allow Hadoop configurations and env variables to be referenced in 
 spark-defaults.conf
 -

 Key: SPARK-6277
 URL: https://issues.apache.org/jira/browse/SPARK-6277
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.3.0, 1.2.1
Reporter: Jianshi Huang

 I need to set spark.local.dir to use user local home instead of /tmp, but 
 currently spark-defaults.conf can only allow constant values.
 What I want to do is to write:
 bq. spark.local.dir /home/${user.name}/spark/tmp
 or
 bq. spark.local.dir /home/${USER}/spark/tmp
 Otherwise I would have to hack bin/spark-class and pass the option through 
 -Dspark.local.dir
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6277) Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf

2015-03-11 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357072#comment-14357072
 ] 

Marcelo Vanzin edited comment on SPARK-6277 at 3/11/15 3:49 PM:


Sometimes I do miss being able to reference other configs and/or system 
properties / env variables from the config. It would allow some interesting use 
cases, at least from a distribution's point of view.

This should be pretty simple to achieve using the commons-config library, 
although I'd prefer avoiding the dependency since it would be more friendly to 
SPARK-4924.


was (Author: vanzin):
Sometimes I do miss being able to reference other configs and/or system 
properties / env variables from the config. It would allow some interesting use 
cases, at least from a distribution's point of view.

This should be pretty simple to achieve using the commons-config library, 
although I'd prefer avoiding the dependency since it would be more friendly to 
SPARK-4824.

 Allow Hadoop configurations and env variables to be referenced in 
 spark-defaults.conf
 -

 Key: SPARK-6277
 URL: https://issues.apache.org/jira/browse/SPARK-6277
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.3.0, 1.2.1
Reporter: Jianshi Huang

 I need to set spark.local.dir to use user local home instead of /tmp, but 
 currently spark-defaults.conf can only allow constant values.
 What I want to do is to write:
 bq. spark.local.dir /home/${user.name}/spark/tmp
 or
 bq. spark.local.dir /home/${USER}/spark/tmp
 Otherwise I would have to hack bin/spark-class and pass the option through 
 -Dspark.local.dir
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods

2015-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357086#comment-14357086
 ] 

Apache Spark commented on SPARK-6189:
-

User 'mgdadv' has created a pull request for this issue:
https://github.com/apache/spark/pull/4982

 Pandas to DataFrame conversion should check field names for periods
 ---

 Key: SPARK-6189
 URL: https://issues.apache.org/jira/browse/SPARK-6189
 Project: Spark
  Issue Type: Improvement
  Components: DataFrame, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 Issue I ran into:  I imported an R dataset in CSV format into a Pandas 
 DataFrame and then use toDF() to convert that into a Spark DataFrame.  The R 
 dataset had a column with a period in it (column GNP.deflator in the 
 longley dataset).  When I tried to select it using the Spark DataFrame DSL, 
 I could not because the DSL thought the period was selecting a field within 
 GNP.
 Also, since GNP is another field's name, it gives an error which could be 
 obscure to users, complaining:
 {code}
 org.apache.spark.sql.AnalysisException: GetField is not valid on fields of 
 type DoubleType;
 {code}
 We should either handle periods in column names or check during loading and 
 warn/fail gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5692) Model import/export for Word2Vec

2015-03-11 Thread ANUPAM MEDIRATTA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ANUPAM MEDIRATTA updated SPARK-5692:

Comment: was deleted

(was: Manoj Kumar,

Not yet but plan to work on it over the weekend. Is that okay?
)

 Model import/export for Word2Vec
 

 Key: SPARK-5692
 URL: https://issues.apache.org/jira/browse/SPARK-5692
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: ANUPAM MEDIRATTA

 Supoort save and load for Word2VecModel. We may want to discuss whether we 
 want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5692) Model import/export for Word2Vec

2015-03-11 Thread ANUPAM MEDIRATTA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ANUPAM MEDIRATTA updated SPARK-5692:

Comment: was deleted

(was: Manoj Kumar,

Not yet but plan to work on it over the weekend. Is that okay?
)

 Model import/export for Word2Vec
 

 Key: SPARK-5692
 URL: https://issues.apache.org/jira/browse/SPARK-5692
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: ANUPAM MEDIRATTA

 Supoort save and load for Word2VecModel. We may want to discuss whether we 
 want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5692) Model import/export for Word2Vec

2015-03-11 Thread ANUPAM MEDIRATTA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ANUPAM MEDIRATTA updated SPARK-5692:

Comment: was deleted

(was: Manoj Kumar,

Not yet but plan to work on it over the weekend. Is that okay?
)

 Model import/export for Word2Vec
 

 Key: SPARK-5692
 URL: https://issues.apache.org/jira/browse/SPARK-5692
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: ANUPAM MEDIRATTA

 Supoort save and load for Word2VecModel. We may want to discuss whether we 
 want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6284) Support framework authentication and role in Mesos framework

2015-03-11 Thread Timothy Chen (JIRA)
Timothy Chen created SPARK-6284:
---

 Summary: Support framework authentication and role in Mesos 
framework
 Key: SPARK-6284
 URL: https://issues.apache.org/jira/browse/SPARK-6284
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen


Support framework authentication and role in both Coarse grain and fine grain 
mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec

2015-03-11 Thread ANUPAM MEDIRATTA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357054#comment-14357054
 ] 

ANUPAM MEDIRATTA commented on SPARK-5692:
-

Manoj Kumar,

Not yet but plan to work on it over the weekend. Is that okay?


 Model import/export for Word2Vec
 

 Key: SPARK-5692
 URL: https://issues.apache.org/jira/browse/SPARK-5692
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: ANUPAM MEDIRATTA

 Supoort save and load for Word2VecModel. We may want to discuss whether we 
 want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec

2015-03-11 Thread ANUPAM MEDIRATTA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357056#comment-14357056
 ] 

ANUPAM MEDIRATTA commented on SPARK-5692:
-

Manoj Kumar,

Not yet but plan to work on it over the weekend. Is that okay?


 Model import/export for Word2Vec
 

 Key: SPARK-5692
 URL: https://issues.apache.org/jira/browse/SPARK-5692
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: ANUPAM MEDIRATTA

 Supoort save and load for Word2VecModel. We may want to discuss whether we 
 want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec

2015-03-11 Thread ANUPAM MEDIRATTA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357058#comment-14357058
 ] 

ANUPAM MEDIRATTA commented on SPARK-5692:
-

Manoj Kumar,

Not yet but plan to work on it over the weekend. Is that okay?


 Model import/export for Word2Vec
 

 Key: SPARK-5692
 URL: https://issues.apache.org/jira/browse/SPARK-5692
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: ANUPAM MEDIRATTA

 Supoort save and load for Word2VecModel. We may want to discuss whether we 
 want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6285) Duplicated code leads to errors

2015-03-11 Thread Iulian Dragos (JIRA)
Iulian Dragos created SPARK-6285:


 Summary: Duplicated code leads to errors
 Key: SPARK-6285
 URL: https://issues.apache.org/jira/browse/SPARK-6285
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Iulian Dragos


The following class is duplicated inside 
[ParquetTestData|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTestData.scala#L39]
 and 
[ParquetIOSuite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala#L44],
 with exact same code and fully qualified name:

{code}
org.apache.spark.sql.parquet.TestGroupWriteSupport
{code}

The second one was introduced in 
[3b395e10|https://github.com/apache/spark/commit/3b395e10510782474789c9098084503f98ca4830],
 but even though it mentions that `ParquetTestData` should be removed later, I 
couldn't find a corresponding Jira ticket.

This duplicate class causes the Eclipse builder to fail (since src/main and 
src/test are compiled together in Eclipse, unlike Sbt).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5692) Model import/export for Word2Vec

2015-03-11 Thread ANUPAM MEDIRATTA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ANUPAM MEDIRATTA updated SPARK-5692:

Comment: was deleted

(was: Manoj Kumar,

Not yet but plan to work on it over the weekend. Is that okay?
)

 Model import/export for Word2Vec
 

 Key: SPARK-5692
 URL: https://issues.apache.org/jira/browse/SPARK-5692
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: ANUPAM MEDIRATTA

 Supoort save and load for Word2VecModel. We may want to discuss whether we 
 want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6284) Support framework authentication and role in Mesos framework

2015-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357181#comment-14357181
 ] 

Apache Spark commented on SPARK-6284:
-

User 'tnachen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4960

 Support framework authentication and role in Mesos framework
 

 Key: SPARK-6284
 URL: https://issues.apache.org/jira/browse/SPARK-6284
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen

 Support framework authentication and role in both Coarse grain and fine grain 
 mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-03-11 Thread Jason Hubbard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357545#comment-14357545
 ] 

Jason Hubbard commented on SPARK-2243:
--

It also mentions changing the broadcast factory because it doesn't work 
properly with multiple spark contexts:
https://github.com/spark-jobserver/spark-jobserver/blob/5f3cadcc95465fe6d97fdffbf78e38ef5342ffa1/job-server/src/main/resources/application.conf#L29
I believe that is probably related to the JIRA already mentioned in this post, 
but is marked as won't fix: SPARK-3148

I've ran multiple spark contexts in a single JVM and the problem I see is the 
akka actor on the worker tries connecting to the wrong driver.

 Support multiple SparkContexts in the same JVM
 --

 Key: SPARK-2243
 URL: https://issues.apache.org/jira/browse/SPARK-2243
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Spark Core
Affects Versions: 0.7.0, 1.0.0, 1.1.0
Reporter: Miguel Angel Fernandez Diaz

 We're developing a platform where we create several Spark contexts for 
 carrying out different calculations. Is there any restriction when using 
 several Spark contexts? We have two contexts, one for Spark calculations and 
 another one for Spark Streaming jobs. The next error arises when we first 
 execute a Spark calculation and, once the execution is finished, a Spark 
 Streaming job is launched:
 {code}
 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
   at 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
   at 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
 java.io.FileNotFoundException
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 

[jira] [Created] (SPARK-6289) PySpark doesn't maintain SQL Types

2015-03-11 Thread Michael Nazario (JIRA)
Michael Nazario created SPARK-6289:
--

 Summary: PySpark doesn't maintain SQL Types
 Key: SPARK-6289
 URL: https://issues.apache.org/jira/browse/SPARK-6289
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.2.1
Reporter: Michael Nazario


For the TimestampType, Spark SQL requires a datetime.date in Python. However, 
if you collect a row based on that type, you'll end up with a returned value 
which is type datetime.datetime.

I have tried to reproduce this using the pyspark shell, but have been unable 
to. This is definitely a problem coming from pyrolite though:

https://github.com/irmen/Pyrolite/

Pyrolite is being used for datetime and date serialization, but appears to not 
map to date objects, but maps to datetime objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6289) PySpark doesn't maintain SQL date Types

2015-03-11 Thread Michael Nazario (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Nazario updated SPARK-6289:
---
Summary: PySpark doesn't maintain SQL date Types  (was: PySpark doesn't 
maintain SQL Types)

 PySpark doesn't maintain SQL date Types
 ---

 Key: SPARK-6289
 URL: https://issues.apache.org/jira/browse/SPARK-6289
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.2.1
Reporter: Michael Nazario

 For the TimestampType, Spark SQL requires a datetime.date in Python. However, 
 if you collect a row based on that type, you'll end up with a returned value 
 which is type datetime.datetime.
 I have tried to reproduce this using the pyspark shell, but have been unable 
 to. This is definitely a problem coming from pyrolite though:
 https://github.com/irmen/Pyrolite/
 Pyrolite is being used for datetime and date serialization, but appears to 
 not map to date objects, but maps to datetime objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6270) Standalone Master hangs when streaming job completes

2015-03-11 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357627#comment-14357627
 ] 

Josh Rosen commented on SPARK-6270:
---

In the long run, my preference is to remove HistoryServer-like responsibilities 
from the Master: the standalone Master is typically configured with a small 
amount of memory and risks OOMing when loading UIs, even if the UI loading is 
done asynchronously (right now it blocks the main event processing thread).

We might consider trying to add lazy loading as an intermediate stepping-stone 
to properly fixing this issue, but I'd like to argue against that approach: 
lazy loading inside of the Master is going to require mechanisms similar to 
what we have in the HistoryServer's loaderServlet, so we're either going to 
have to duplicate a bunch of code or change the HistoryServer code to be more 
modular so that we can reuse its components it inside of the Master.

Another consideration firewall / port issues: currently, the master web UI and 
the Spark web UIs that it loads are served on the same port.  If we set up a 
new Jetty server for the UIs, whether in the same Master JVM or in a separate 
HistoryServer process, then the Spark UIs will be served at some different 
port, potentially breaking those links in environments where only the master 
web UI port is exposed.  I think it's going to be really painful to avoid this, 
though, and I don't think we should resort to solutions where we proxy the 
Spark UI through the master UI, since the responses could be huge and lead to 
OOMs in the proxy.

I think we should Introduce a new configuration which completely disables the 
master's Spark UI serving feature, backport this to all maintenance branches, 
and mention this feature in the release notes.

For Spark 1.4, I think we should completely remove the web UI serving from the 
Master and provide the ability to configure the master with a HistoryServer 
address which will be used to generate links to UIs.  This runs into its own 
set of problems, though: the current HistoryServer FSHistoryProvider assumes 
that all applications' event logs are located in the same directory, whereas 
the Master can load event logs from any directory which is specified in the 
application description.  This means that we'll need a way to instruct the 
HistoryServer to load logs from an arbitrary path.  Therefore, maybe we should 
extend the HistoryServer's HTTP interface to allow requests to specify the 
event log location (falling back to the history server's default event log 
directory if no alternate log location was specified).  This could have 
security implications, though; we'd have to be careful to ensure that this 
doesn't allow arbitrary file reads.

 Standalone Master hangs when streaming job completes
 

 Key: SPARK-6270
 URL: https://issues.apache.org/jira/browse/SPARK-6270
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Streaming
Affects Versions: 1.2.0, 1.3.0, 1.2.1
Reporter: Tathagata Das
Priority: Critical

 If the event logging is enabled, the Spark Standalone Master tries to 
 recreate the web UI of a completed Spark application from its event logs. 
 However if this event log is huge (e.g. for a Spark Streaming application), 
 then the master hangs in its attempt to read and recreate the web ui. This 
 hang causes the whole standalone cluster to be unusable. 
 Workaround is to disable the event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357497#comment-14357497
 ] 

Jeremy Freeman edited comment on SPARK-2429 at 3/11/15 8:10 PM:


Thanks for the update and contribution [~yuu.ishik...@gmail.com]! I think I 
agree with [~josephkb] that it is worth bringing this into MLlib, as the 
algorithm itself will translate to future uses, and many groups (including 
ours!) will find it useful now.

It might be worth adding to spark-packages, especially if we expect the review 
to take awhile. Those seem especially useful as a way to provide easy access to 
testing experimental pieces of functionality. But I'd probably prioritize just 
reviewing the patch.

Also agree with the others that we should start a new PR with the new 
algorithm, 1000x faster is a lot! It is worth incorporating some of comments 
from the old PR if you haven't already, if relevant in the new version.

I'd be happy to go through the new PR as I'm quite familiar with the problem / 
algorithm, but it would help if you could say a little more about what you did 
so differently here, to help guide me as I look at the code.


was (Author: freeman-lab):
Thanks for the update and contribution [~yuu.ishik...@gmail.com]! I think I 
agree with [~josephkb] that it is worth bringing this into MLlib, as the 
algorithm itself will translate to future uses, and many groups (including 
ours!) will find it useful now.

It might be worth adding to spark-packages, especially if we expect the review 
to take awhile. Those seem especially useful as a way to provide easy access to 
testing new pieces of functionality. But I'd probably prioritize just reviewing 
the patch.

Also agree with the others that we should start a new PR with the new 
algorithm, 1000x faster is a lot! It is worth incorporating some of comments 
from the old PR if you haven't already, if relevant in the new version.

I'd be happy to go through the new PR as I'm quite familiar with the problem / 
algorithm, but it would help if you could say a little more about what you did 
so differently here, to help guide me as I look at the code.

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357497#comment-14357497
 ] 

Jeremy Freeman commented on SPARK-2429:
---

Thanks for the update and contribution [~yuu.ishik...@gmail.com]! I think I 
agree with [~josephkb] that it is worth bringing this into MLlib, as the 
algorithm itself will translate to future uses, and many groups (including 
ours!) will find it useful now.

It might be worth adding to spark-packages, especially if we expect the review 
to take awhile. Those seem especially useful as a way to provide easy access to 
testing new pieces of functionality. But I'd probably prioritize just reviewing 
the patch.

Also agree with the others that we should start a new PR with the new 
algorithm, 1000x faster is a lot! It is worth incorporating some of comments 
from the old PR if you haven't already, if relevant in the new version.

I'd be happy to go through the new PR as I'm quite familiar with the problem / 
algorithm, but it would help if you could say a little more about what you did 
so differently here, to help guide me as I look at the code.

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5987) Model import/export for GaussianMixtureModel

2015-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357502#comment-14357502
 ] 

Apache Spark commented on SPARK-5987:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/4986

 Model import/export for GaussianMixtureModel
 

 Key: SPARK-5987
 URL: https://issues.apache.org/jira/browse/SPARK-5987
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 Support save/load for GaussianMixtureModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5050) Add unit test for sqdist

2015-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357616#comment-14357616
 ] 

Apache Spark commented on SPARK-5050:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4985

 Add unit test for sqdist
 

 Key: SPARK-5050
 URL: https://issues.apache.org/jira/browse/SPARK-5050
 Project: Spark
  Issue Type: Test
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
Priority: Minor
 Fix For: 1.3.0


 Related to #3643. Follow the previous suggestion to add unit test for sqdist 
 in VectorsSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4797) Replace breezeSquaredDistance

2015-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357617#comment-14357617
 ] 

Apache Spark commented on SPARK-4797:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4985

 Replace breezeSquaredDistance
 -

 Key: SPARK-4797
 URL: https://issues.apache.org/jira/browse/SPARK-4797
 Project: Spark
  Issue Type: Improvement
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
Priority: Minor
 Fix For: 1.3.0


 This PR replaces slow breezeSquaredDistance. A simple calculation involving 
 4 squared distances between the vectors of 2 dims shows:
 * breezeSquaredDistance: ~12 secs
 * This PR: ~10.5 secs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6206) spark-ec2 script reporting SSL error?

2015-03-11 Thread Joe O (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357260#comment-14357260
 ] 

Joe O edited comment on SPARK-6206 at 3/11/15 5:38 PM:
---

Ok close this one out.

The problem was local configuration. 

I had installed the Google Cloud Services tools, which created a ~/.boto file.

The boto library that came packaged with Spark picked up on this and started 
using the settings in the file, which conflicted with what I was telling Spark 
to use.

Renaming the ~/.boto file temporarily cause the spark-ec2 script to start 
working again.

Documenting everything here in case someone else runs into this problem.


was (Author: joe6521):
Ok close this one out.

The problem was local configuration. 

I had installed the Google Cloud Services tools, which created a ~/.boto file.

The boto library that packaged with Spark picked up on this and started using 
the settings in the file, which conflicted with what I was telling Spark to use.

Renaming the ~/.boto file temporarily cause the spark-ec2 script to start 
working again.

Documenting everything here in case someone else runs into this problem.

 spark-ec2 script reporting SSL error?
 -

 Key: SPARK-6206
 URL: https://issues.apache.org/jira/browse/SPARK-6206
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.2.0
Reporter: Joe O

 I have been using the spark-ec2 script for several months with no problems.
 Recently, when executing a script to launch a cluster I got the following 
 error:
 {code}
 [Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate 
 routines:X509_load_cert_crl_file:system lib
 {code}
 Nothing launches, the script exits.
 I am not sure if something on machine changed, this is a problem with EC2's 
 certs, or a problem with Python. 
 It occurs 100% of the time, and has been occurring over at least the last two 
 days. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6274) Add streaming examples showing integration with DataFrames and SQL

2015-03-11 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-6274.
--
   Resolution: Fixed
Fix Version/s: 1.3.1
   1.4.0

 Add streaming examples showing integration with DataFrames and SQL
 --

 Key: SPARK-6274
 URL: https://issues.apache.org/jira/browse/SPARK-6274
 Project: Spark
  Issue Type: Improvement
  Components: Examples, Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
 Fix For: 1.4.0, 1.3.1


 Self explanatory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6227) PCA and SVD for PySpark

2015-03-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6227:
-
Priority: Major  (was: Minor)

 PCA and SVD for PySpark
 ---

 Key: SPARK-6227
 URL: https://issues.apache.org/jira/browse/SPARK-6227
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.2.1
Reporter: Julien Amelot

 The Dimensionality Reduction techniques are not available via Python (Scala + 
 Java only).
 * Principal component analysis (PCA)
 * Singular value decomposition (SVD)
 Doc:
 http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark

2015-03-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357354#comment-14357354
 ] 

Joseph K. Bradley commented on SPARK-6227:
--

I was about to---but then I realized this is a much bigger task than it appears 
since it will require writing Python wrappers for the various distributed 
matrix types.  I just made this a subtask of another JIRA.  Could you take a 
look at the parent JIRA and the distributed matrices code and figure out a good 
piece of the work to start with?  Hopefully we can break the work into pieces 
in a natural way.  Thanks!

 PCA and SVD for PySpark
 ---

 Key: SPARK-6227
 URL: https://issues.apache.org/jira/browse/SPARK-6227
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.2.1
Reporter: Julien Amelot

 The Dimensionality Reduction techniques are not available via Python (Scala + 
 Java only).
 * Principal component analysis (PCA)
 * Singular value decomposition (SVD)
 Doc:
 http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357367#comment-14357367
 ] 

RJ Nowling commented on SPARK-2429:
---

Hi [~yuu.ishik...@gmail.com]

I think the new implementation is great.  Did you change the algorithm?

I've spoken with [~srowen].  The hierarchical clustering would be valuable to 
the community -- I actually had a couple people reach out to me about it. 
However, Spark is currently undergoing the transition to the new ML API and as 
such, there is concern about accepting code into the older MLlib library.  With 
the announcement of Spark packages, there is also a move to encourage external 
libraries instead of large commits into Spark itself.

Would you be interested in publishing your hierarchical clustering 
implementation as an external library like [~derrickburns] did for the [KMeans 
Mini Batch 
implementation|https://github.com/derrickburns/generalized-kmeans-clustering]?  
 It could be listed in the [Spark packages index|http://spark-packages.org/] 
along with two other clustering packages so users can find it.

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6282) Strange Python import error when using random() in a lambda function

2015-03-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357382#comment-14357382
 ] 

Sean Owen commented on SPARK-6282:
--

http://stackoverflow.com/questions/11133506/importerror-while-importing-winreg-module-of-python

It sounds like something you are calling invokes a Windows-only Python library 
called winreg, but you're executing on Linux.
This doesn't sound Spark-related, as certainly Spark does not invoke this.

 Strange Python import error when using random() in a lambda function
 

 Key: SPARK-6282
 URL: https://issues.apache.org/jira/browse/SPARK-6282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Kubuntu 14.04, Python 2.7.6
Reporter: Pavel Laskov
Priority: Minor

 Consider the exemplary Python code below:
from random import random
from pyspark.context import SparkContext
from xval_mllib import read_csv_file_as_list
 if __name__ == __main__: 
 sc = SparkContext(appName=Random() bug test)
 data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv'))
 #data = sc.parallelize([1, 2, 3, 4, 5], 2)
 d = data.map(lambda x: (random(), x))
 print d.first()
 Data is read from a large CSV file. Running this code results in a Python 
 import error:
 ImportError: No module named _winreg
 If I use 'import random' and 'random.random()' in the lambda function no 
 error occurs. Also no error occurs, for both kinds of import statements, for 
 a small artificial data set like the one shown in a commented line.  
 The full error trace, the source code of csv reading code (function 
 'read_csv_file_as_list' is my own) as well as a sample dataset (the original 
 dataset is about 8M large) can be provided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-03-11 Thread Peter Rudenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357591#comment-14357591
 ] 

Peter Rudenko commented on SPARK-2243:
--

Unfortunatelly it doesn't work in spark-jobserver (at least for version 1.2.+). 
Take a look to [this 
tread|https://groups.google.com/d/msg/spark-jobserver/f466U2vydMY/s3b0xPNn4U8J].

 Support multiple SparkContexts in the same JVM
 --

 Key: SPARK-2243
 URL: https://issues.apache.org/jira/browse/SPARK-2243
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Spark Core
Affects Versions: 0.7.0, 1.0.0, 1.1.0
Reporter: Miguel Angel Fernandez Diaz

 We're developing a platform where we create several Spark contexts for 
 carrying out different calculations. Is there any restriction when using 
 several Spark contexts? We have two contexts, one for Spark calculations and 
 another one for Spark Streaming jobs. The next error arises when we first 
 execute a Spark calculation and, once the execution is finished, a Spark 
 Streaming job is launched:
 {code}
 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
   at 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
   at 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
 java.io.FileNotFoundException
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at 

[jira] [Commented] (SPARK-6206) spark-ec2 script reporting SSL error?

2015-03-11 Thread Joe O (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357260#comment-14357260
 ] 

Joe O commented on SPARK-6206:
--

Ok close this one out.

The problem was local configuration. 

I had installed the Google Cloud Services tools, which created a ~/.boto file.

The boto library that packaged with Spark picked up on this and started using 
the settings in the file, which conflicted with what I was telling Spark to use.

Renaming the ~/.boto file temporarily cause the spark-ec2 script to start 
working again.

Documenting everything here in case someone else runs into this problem.

 spark-ec2 script reporting SSL error?
 -

 Key: SPARK-6206
 URL: https://issues.apache.org/jira/browse/SPARK-6206
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.2.0
Reporter: Joe O

 I have been using the spark-ec2 script for several months with no problems.
 Recently, when executing a script to launch a cluster I got the following 
 error:
 {code}
 [Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate 
 routines:X509_load_cert_crl_file:system lib
 {code}
 Nothing launches, the script exits.
 I am not sure if something on machine changed, this is a problem with EC2's 
 certs, or a problem with Python. 
 It occurs 100% of the time, and has been occurring over at least the last two 
 days. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5186) Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size

2015-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357272#comment-14357272
 ] 

Apache Spark commented on SPARK-5186:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4985

 Vector.equals  and Vector.hashCode are very inefficient and fail on 
 SparseVectors with large size
 -

 Key: SPARK-5186
 URL: https://issues.apache.org/jira/browse/SPARK-5186
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Derrick Burns
Assignee: yuhao yang
 Fix For: 1.3.0

   Original Estimate: 0.25h
  Remaining Estimate: 0.25h

 The implementation of Vector.equals and Vector.hashCode are correct but slow 
 for SparseVectors that are truly sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6288) Pyrolite calls hashCode to cache previously serialized objects

2015-03-11 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-6288:


 Summary: Pyrolite calls hashCode to cache previously serialized 
objects
 Key: SPARK-6288
 URL: https://issues.apache.org/jira/browse/SPARK-6288
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.2.1, 1.1.1, 1.0.2, 1.3.0
Reporter: Xiangrui Meng
Assignee: Josh Rosen


https://github.com/irmen/Pyrolite/blob/v2.0/java/src/net/razorvine/pickle/Pickler.java#L140

This operation could be quite expensive, compared to serializing the object 
directly, because hashCode usually needs to access all data stored in the 
object. Maybe we should disable this feature by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357380#comment-14357380
 ] 

Joseph K. Bradley edited comment on SPARK-2429 at 3/11/15 6:47 PM:
---

But as far as the old vs. new algorithm, it sounds like it would make sense to 
replace the old one with the new one for this PR (though I have not yet had a 
chance to compare and understand them in detail).  Are there tradeoffs?


was (Author: josephkb):
But as far as the old vs. new algorithm, it sounds like it would make sense to 
replace the old one with the new one for this PR (though I have not yet had a 
chance to compare and understand them in detail).

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6286) Handle TASK_ERROR in TaskState

2015-03-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357385#comment-14357385
 ] 

Sean Owen commented on SPARK-6286:
--

Probably OK for 1.4.x. Do you know what to do with this case?
[~jongyoul] do you have an opinion? I think you made the update to 0.21.0.

 Handle TASK_ERROR in TaskState
 --

 Key: SPARK-6286
 URL: https://issues.apache.org/jira/browse/SPARK-6286
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Iulian Dragos
Priority: Minor
  Labels: mesos

 Scala warning:
 {code}
 match may not be exhaustive. It would fail on the following input: TASK_ERROR
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357407#comment-14357407
 ] 

Joseph K. Bradley commented on SPARK-2429:
--

I'll try to prioritize it, though the next week or so will be difficult because 
of the Spark Summit.  It will be valuable to have your input still since you're 
familiar with the PR.  Thanks for both of your efforts on this!

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2015-03-11 Thread Craig Lukasik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357463#comment-14357463
 ] 

Craig Lukasik commented on SPARK-2243:
--

The Spark Job Server uses multiple SparkContexts. I think the trick might be 
based on using a separate class loader. See line 250 here: 
https://github.com/ooyala/spark-jobserver/blob/master/job-server/src/spark.jobserver/JobManagerActor.scala
 . Sadly, I could not replicate this trick in Java (as subsequent uses of Spark 
API in my code resulted in ClassCastException's). I wonder if Scala is more 
forgiving?

 Support multiple SparkContexts in the same JVM
 --

 Key: SPARK-2243
 URL: https://issues.apache.org/jira/browse/SPARK-2243
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Spark Core
Affects Versions: 0.7.0, 1.0.0, 1.1.0
Reporter: Miguel Angel Fernandez Diaz

 We're developing a platform where we create several Spark contexts for 
 carrying out different calculations. Is there any restriction when using 
 several Spark contexts? We have two contexts, one for Spark calculations and 
 another one for Spark Streaming jobs. The next error arises when we first 
 execute a Spark calculation and, once the execution is finished, a Spark 
 Streaming job is launched:
 {code}
 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
   at 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
   at 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
 java.io.FileNotFoundException
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 

[jira] [Commented] (SPARK-6287) Add support for dynamic allocation in the Mesos coarse-grained scheduler

2015-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357262#comment-14357262
 ] 

Apache Spark commented on SPARK-6287:
-

User 'dragos' has created a pull request for this issue:
https://github.com/apache/spark/pull/4984

 Add support for dynamic allocation in the Mesos coarse-grained scheduler
 

 Key: SPARK-6287
 URL: https://issues.apache.org/jira/browse/SPARK-6287
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Iulian Dragos

 Add support inside the coarse-grained Mesos scheduler for dynamic allocation. 
 It amounts to implementing two methods that allow scaling up and down the 
 number of executors:
 {code}
 def doKillExecutors(executorIds: Seq[String])
 def doRequestTotalExecutors(requestedTotal: Int)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-03-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357335#comment-14357335
 ] 

Joseph K. Bradley commented on SPARK-5992:
--

I'm not sure what the most important or useful ones would be.  I've heard the 
most about random projections.  Are you familiar with the others?  Perhaps we 
can make a list of algorithms and their properties to try to get a reasonable 
coverage of use cases.

 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6285) Duplicated code leads to errors

2015-03-11 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357297#comment-14357297
 ] 

Iulian Dragos commented on SPARK-6285:
--

According to the git commit message that introduced the duplicate:

{quote}
 To avoid potential merge conflicts, old testing code are not removed yet. The 
following classes can be safely removed after most Parquet related PRs are 
handled:

- `ParquetQuerySuite`
- `ParquetTestData`
{quote}

I mentioned the Eclipse build problem in passing, but I can expand: the class 
*is* a duplicated name, so the Scala compiler is correct in refusing it. It 
only compiles in Sbt/Maven because the src/main and src/test are compiled in 
separate compiler runs, and scalac seems to not notice the duplicate name when 
it comes from bytecode. Eclipse builds src/main and src/test together, and when 
both classes originate from sources scalac issues an error message.


 Duplicated code leads to errors
 ---

 Key: SPARK-6285
 URL: https://issues.apache.org/jira/browse/SPARK-6285
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Iulian Dragos

 The following class is duplicated inside 
 [ParquetTestData|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTestData.scala#L39]
  and 
 [ParquetIOSuite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala#L44],
  with exact same code and fully qualified name:
 {code}
 org.apache.spark.sql.parquet.TestGroupWriteSupport
 {code}
 The second one was introduced in 
 [3b395e10|https://github.com/apache/spark/commit/3b395e10510782474789c9098084503f98ca4830],
  but even though it mentions that `ParquetTestData` should be removed later, 
 I couldn't find a corresponding Jira ticket.
 This duplicate class causes the Eclipse builder to fail (since src/main and 
 src/test are compiled together in Eclipse, unlike Sbt).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6285) Duplicated code leads to errors

2015-03-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357388#comment-14357388
 ] 

Sean Owen commented on SPARK-6285:
--

[~lian cheng] do you think we can now remove these two classes in order to 
resolve the issue?

 Duplicated code leads to errors
 ---

 Key: SPARK-6285
 URL: https://issues.apache.org/jira/browse/SPARK-6285
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Iulian Dragos

 The following class is duplicated inside 
 [ParquetTestData|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTestData.scala#L39]
  and 
 [ParquetIOSuite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala#L44],
  with exact same code and fully qualified name:
 {code}
 org.apache.spark.sql.parquet.TestGroupWriteSupport
 {code}
 The second one was introduced in 
 [3b395e10|https://github.com/apache/spark/commit/3b395e10510782474789c9098084503f98ca4830],
  but even though it mentions that `ParquetTestData` should be removed later, 
 I couldn't find a corresponding Jira ticket.
 This duplicate class causes the Eclipse builder to fail (since src/main and 
 src/test are compiled together in Eclipse, unlike Sbt).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357389#comment-14357389
 ] 

RJ Nowling commented on SPARK-2429:
---

[~josephkb]  I think it would be great to get the new implementation into Spark 
but we need a champion for it.  [~yuu.ishik...@gmail.com] did some great work, 
and I've been trying to shepard the work but we need a committer who wants to 
bring it in.  If you want to do that, then I can step back and let you and 
[~yuu.ishik...@gmail.com] bring this across the finish line.

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357410#comment-14357410
 ] 

RJ Nowling commented on SPARK-2429:
---

I'm familiar with the community interest but I'm not terribly familiar with the 
implementations (old or now).  [~freeman-lab] may be the appropriate person to 
ask for help -- the original implementation was based on his gist.

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-5186) Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size

2015-03-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reopened SPARK-5186:
--

Re-opened this issue for branch-1.2.

 Vector.equals  and Vector.hashCode are very inefficient and fail on 
 SparseVectors with large size
 -

 Key: SPARK-5186
 URL: https://issues.apache.org/jira/browse/SPARK-5186
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Derrick Burns
Assignee: yuhao yang
 Fix For: 1.3.0

   Original Estimate: 0.25h
  Remaining Estimate: 0.25h

 The implementation of Vector.equals and Vector.hashCode are correct but slow 
 for SparseVectors that are truly sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6282) Strange Python import error when using random() in a lambda function

2015-03-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357364#comment-14357364
 ] 

Joseph K. Bradley commented on SPARK-6282:
--

Do you know where _winreg appears in the code you're running?  Is it being 
brought in by the read_csv_file_as_list method or its containing package?

 Strange Python import error when using random() in a lambda function
 

 Key: SPARK-6282
 URL: https://issues.apache.org/jira/browse/SPARK-6282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Kubuntu 14.04, Python 2.7.6
Reporter: Pavel Laskov
Priority: Minor

 Consider the exemplary Python code below:
from random import random
from pyspark.context import SparkContext
from xval_mllib import read_csv_file_as_list
 if __name__ == __main__: 
 sc = SparkContext(appName=Random() bug test)
 data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv'))
 #data = sc.parallelize([1, 2, 3, 4, 5], 2)
 d = data.map(lambda x: (random(), x))
 print d.first()
 Data is read from a large CSV file. Running this code results in a Python 
 import error:
 ImportError: No module named _winreg
 If I use 'import random' and 'random.random()' in the lambda function no 
 error occurs. Also no error occurs, for both kinds of import statements, for 
 a small artificial data set like the one shown in a commented line.  
 The full error trace, the source code of csv reading code (function 
 'read_csv_file_as_list' is my own) as well as a sample dataset (the original 
 dataset is about 8M large) can be provided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6286) Handle TASK_ERROR in TaskState

2015-03-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357240#comment-14357240
 ] 

Sean Owen commented on SPARK-6286:
--

I remember looking at this -- it's a Mesos enum right? I wondered if it were a 
new-ish value, and handling it would somehow make the code incompatible with 
older versions of Mesos, so I didn't want to touch it. But that is not based on 
any real knowledge.

If you know of what to do in this state and confirm that it's not a value 
specific to only some supported Mesos versions, then I'd go for it and add a 
handler.

 Handle TASK_ERROR in TaskState
 --

 Key: SPARK-6286
 URL: https://issues.apache.org/jira/browse/SPARK-6286
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Iulian Dragos
Priority: Minor
  Labels: mesos

 Scala warning:
 {code}
 match may not be exhaustive. It would fail on the following input: TASK_ERROR
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods

2015-03-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357248#comment-14357248
 ] 

Joseph K. Bradley commented on SPARK-6189:
--

Maybe it's actually OK to allow periods in fields names since SQL does.  We 
could be like SQL, where periods are OK and users just need to make sure to 
quote the field name so that SQL doesn't think the period is indicating a 
subfield.  I haven't tried this yet with DataFrames to check its behavior.

 Pandas to DataFrame conversion should check field names for periods
 ---

 Key: SPARK-6189
 URL: https://issues.apache.org/jira/browse/SPARK-6189
 Project: Spark
  Issue Type: Improvement
  Components: DataFrame, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 Issue I ran into:  I imported an R dataset in CSV format into a Pandas 
 DataFrame and then use toDF() to convert that into a Spark DataFrame.  The R 
 dataset had a column with a period in it (column GNP.deflator in the 
 longley dataset).  When I tried to select it using the Spark DataFrame DSL, 
 I could not because the DSL thought the period was selecting a field within 
 GNP.
 Also, since GNP is another field's name, it gives an error which could be 
 obscure to users, complaining:
 {code}
 org.apache.spark.sql.AnalysisException: GetField is not valid on fields of 
 type DoubleType;
 {code}
 We should either handle periods in column names or check during loading and 
 warn/fail gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6206) spark-ec2 script reporting SSL error?

2015-03-11 Thread Joe O (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe O closed SPARK-6206.

Resolution: Invalid

 spark-ec2 script reporting SSL error?
 -

 Key: SPARK-6206
 URL: https://issues.apache.org/jira/browse/SPARK-6206
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.2.0
Reporter: Joe O

 I have been using the spark-ec2 script for several months with no problems.
 Recently, when executing a script to launch a cluster I got the following 
 error:
 {code}
 [Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate 
 routines:X509_load_cert_crl_file:system lib
 {code}
 Nothing launches, the script exits.
 I am not sure if something on machine changed, this is a problem with EC2's 
 certs, or a problem with Python. 
 It occurs 100% of the time, and has been occurring over at least the last two 
 days. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6287) Add support for dynamic allocation in the Mesos coarse-grained scheduler

2015-03-11 Thread Iulian Dragos (JIRA)
Iulian Dragos created SPARK-6287:


 Summary: Add support for dynamic allocation in the Mesos 
coarse-grained scheduler
 Key: SPARK-6287
 URL: https://issues.apache.org/jira/browse/SPARK-6287
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Iulian Dragos


Add support inside the coarse-grained Mesos scheduler for dynamic allocation. 
It amounts to implementing two methods that allow scaling up and down the 
number of executors:

{code}
def doKillExecutors(executorIds: Seq[String])
def doRequestTotalExecutors(requestedTotal: Int)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357380#comment-14357380
 ] 

Joseph K. Bradley commented on SPARK-2429:
--

But as far as the old vs. new algorithm, it sounds like it would make sense to 
replace the old one with the new one for this PR (though I have not yet had a 
chance to compare and understand them in detail).

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357377#comment-14357377
 ] 

Joseph K. Bradley commented on SPARK-2429:
--

I don't think we should discourage contributions to the current MLlib package.  
It's true we're experimenting with spark.ml and figuring out a better long-term 
API, but the main work in JIRAs/PRs like this one is in designing and 
implementing the algorithm itself, which we can easily copy over to the new API 
when the time comes.  I'd vote for trying to get this into spark.mllib and then 
wrapping it for spark.ml when ready.  (It'd be fine to do a Spark package too, 
but I hope this can get into MLlib itself.)

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6290) spark.ml.param.Params.checkInputColumn bug upon error

2015-03-11 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-6290:


 Summary: spark.ml.param.Params.checkInputColumn bug upon error
 Key: SPARK-6290
 URL: https://issues.apache.org/jira/browse/SPARK-6290
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor


In checkInputColumn, if data types do not match, it tries to print an error 
message with this in it:
{code}
Column param description: ${getParam(colName)}
{code}
However, getParam cannot be called on the string colName; it needs the 
parameter name, which this method is not given.  This causes a weird error 
which users may find hard to understand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6297) EventLog permissions are always set to 770 which causes problems

2015-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357905#comment-14357905
 ] 

Apache Spark commented on SPARK-6297:
-

User 'lustefaniak' has created a pull request for this issue:
https://github.com/apache/spark/pull/4989

 EventLog permissions are always set to 770 which causes problems
 

 Key: SPARK-6297
 URL: https://issues.apache.org/jira/browse/SPARK-6297
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: All, tested in lcx running different users without 
 common group
Reporter: Lukas Stefaniak
Priority: Trivial
  Labels: newbie

 In EventLogginListener event log files permissions are set explicitly always 
 to 770. There is no way to override it in any way.
 Problem appears as exception being thrown when driver process and spark 
 master don't share same user or group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6229) Support encryption in network/common module

2015-03-11 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357838#comment-14357838
 ] 

Marcelo Vanzin commented on SPARK-6229:
---

/cc [~rxin] [~adav]

Hey Reynold, Aaron,

I started looking at this and it seems like there's no easy way to insert SASL 
encryption into the current channel pipeline due to the way the code is 
organized.

On the client side, {{TransportClientFactory.createClient}} seems like the best 
candidate, but it would have to be changed so that {{SaslClientBootstrap}} can 
add a handler to the channel pipeline to do the encryption.

On the server side, it seems more complicated. I couldn't find a place where 
the current SASL code has access to the channel, so it can't really set up a 
handler at the same layer as the client code. At first I thought I could just 
encrypt things in {{SaslRpcHandler}}, but that looks fishy since that handler 
is before all the framing and serialization handlers, and we kinda want to 
encrypt after those.

So I'd like to suggest a bigger change that I think would make the code easier 
to use and at the same time allow this change: internalize all SASL handling in 
the network library. 

Basically, let the network library decide whether to add SASL to the picture by 
looking at the TransportConf; code using the library wouldn't need to care 
about it anymore (aside from providing a TransportConf and a SecretKeyHolder).

So, instead of code like this (from StandaloneWorkerShuffleService):

{code}
  private val transportContext: TransportContext = {
val handler = if (useSasl) new SaslRpcHandler(blockHandler, 
securityManager) else blockHandler
new TransportContext(transportConf, handler)
  }
{code}

You'd have just:

{code}
  private val transportContext: TransportContext = new 
TransportContext(transportConf, securityManager, blockHandler)
{code}

And similarly in other places. What do you guys think?

BTW here's some code that does this and is totally transparent to the code 
creating the RPC server / client:

https://github.com/apache/hive/blob/trunk/spark-client/src/main/java/org/apache/hive/spark/client/rpc/SaslHandler.java

And the two implementations:
https://github.com/apache/hive/blob/trunk/spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java#L197
https://github.com/apache/hive/blob/trunk/spark-client/src/main/java/org/apache/hive/spark/client/rpc/Rpc.java#L390

That's just to give an idea of how it could be done internally in the network 
library, without the consumers of the library having to care about it.

But please let me know if I missed something obvious here.

 Support encryption in network/common module
 ---

 Key: SPARK-6229
 URL: https://issues.apache.org/jira/browse/SPARK-6229
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Marcelo Vanzin

 After SASL support has been added to network/common, supporting encryption 
 should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. 
 Since the latter requires a valid kerberos login to work (and so doesn't 
 really work with executors), encryption would require the use of DIGEST-MD5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6294) PySpark task may hang while call take() on in Java/Scala

2015-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357872#comment-14357872
 ] 

Apache Spark commented on SPARK-6294:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4987

 PySpark task may hang while call take() on in Java/Scala
 

 Key: SPARK-6294
 URL: https://issues.apache.org/jira/browse/SPARK-6294
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0, 1.2.1
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical

 {code}
  rdd = sc.parallelize(range(120)).map(lambda x: str(x))
  rdd._jrdd.first()
 {code}
 There is the stacktrace while hanging:
 {code}
 Executor task launch worker-5 daemon prio=10 tid=0x7f8fd01a9800 
 nid=0x566 in Object.wait() [0x7f90481d7000]
java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on 0x000630929340 (a 
 org.apache.spark.api.python.PythonRDD$WriterThread)
   at java.lang.Thread.join(Thread.java:1281)
   - locked 0x000630929340 (a 
 org.apache.spark.api.python.PythonRDD$WriterThread)
   at java.lang.Thread.join(Thread.java:1355)
   at 
 org.apache.spark.api.python.PythonRDD$$anonfun$compute$1.apply(PythonRDD.scala:78)
   at 
 org.apache.spark.api.python.PythonRDD$$anonfun$compute$1.apply(PythonRDD.scala:76)
   at 
 org.apache.spark.TaskContextImpl$$anon$1.onTaskCompletion(TaskContextImpl.scala:49)
   at 
 org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68)
   at 
 org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66)
   at org.apache.spark.scheduler.Task.run(Task.scala:58)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6296) Add equals operator to Column (v1.3)

2015-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357894#comment-14357894
 ] 

Apache Spark commented on SPARK-6296:
-

User 'vlyubin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4988

 Add equals operator to Column (v1.3)
 

 Key: SPARK-6296
 URL: https://issues.apache.org/jira/browse/SPARK-6296
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Volodymyr Lyubinets
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods

2015-03-11 Thread mgdadv (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357610#comment-14357610
 ] 

mgdadv edited comment on SPARK-6189 at 3/11/15 10:20 PM:
-

While the dot is legal in R and SQL, I don't think there is a nice way of 
making it
legal in python. So at least in the Spark python code, I think something should
be done about it.

I just realized that the automatic renaming can cause problems if that entry
already exists.  For example, what if GNP_deflator was already in the data set
and then GNP.deflator gets changed.

I think the best thing to do is to just warn the user by printing out a warning
message. I have changed the patch accordingly.

Here is some example code for pyspark:

import StringIO
import pandas as pd
df = pd.read_csv(StringIO.StringIO(a.b,a,c\n101,102,103\n201,202,203))
spdf = sqlCtx.createDataFrame(df)
spdf.take(2)
spdf[spdf.a==102].take(2)

So far this works, but this fails:
spdf[spdf.a.b==101].take(2)

In pandas df.a.b doesn't work either, but the fields can be accessed via the 
string a.b, i.e.:
df[a.b]



was (Author: mgdadv):
While the dot is legal in R and SQL, I don't think there is a nice way of 
making it
legal in python. So at least in the Spark python code, I think something should
be done about it.

I just realized that the automatic renaming can cause problems if that entry
already exists.  For example, what if GNP_deflator was already in the data set
and then GNP.deflator gets changed.

I think the best thing to do is to just warn the user by printing out a warning
message. I have changed the patch accordingly.

Here is some example code for pyspark:

import pandas as pd
df = pd.read_csv(StringIO.StringIO(a.b,a,c\n101,102,103\n201,202,203))
spdf = sqlCtx.createDataFrame(df)
spdf.take(2)
spdf[spdf.a==102].take(2)

So far this works, but this fails:
spdf[spdf.a.b==101].take(2)

In pandas df.a.b doesn't work either, but the fields can be accessed via the 
string a.b, i.e.:
df[a.b]


 Pandas to DataFrame conversion should check field names for periods
 ---

 Key: SPARK-6189
 URL: https://issues.apache.org/jira/browse/SPARK-6189
 Project: Spark
  Issue Type: Improvement
  Components: DataFrame, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 Issue I ran into:  I imported an R dataset in CSV format into a Pandas 
 DataFrame and then use toDF() to convert that into a Spark DataFrame.  The R 
 dataset had a column with a period in it (column GNP.deflator in the 
 longley dataset).  When I tried to select it using the Spark DataFrame DSL, 
 I could not because the DSL thought the period was selecting a field within 
 GNP.
 Also, since GNP is another field's name, it gives an error which could be 
 obscure to users, complaining:
 {code}
 org.apache.spark.sql.AnalysisException: GetField is not valid on fields of 
 type DoubleType;
 {code}
 We should either handle periods in column names or check during loading and 
 warn/fail gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods

2015-03-11 Thread mgdadv (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357713#comment-14357713
 ] 

mgdadv commented on SPARK-6189:
---

To expand on the example, this works:
spdf[spdf[a]==102].take(2) 

while this fails:
spdf[spdf[a.b]==101].take(2)

 Pandas to DataFrame conversion should check field names for periods
 ---

 Key: SPARK-6189
 URL: https://issues.apache.org/jira/browse/SPARK-6189
 Project: Spark
  Issue Type: Improvement
  Components: DataFrame, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 Issue I ran into:  I imported an R dataset in CSV format into a Pandas 
 DataFrame and then use toDF() to convert that into a Spark DataFrame.  The R 
 dataset had a column with a period in it (column GNP.deflator in the 
 longley dataset).  When I tried to select it using the Spark DataFrame DSL, 
 I could not because the DSL thought the period was selecting a field within 
 GNP.
 Also, since GNP is another field's name, it gives an error which could be 
 obscure to users, complaining:
 {code}
 org.apache.spark.sql.AnalysisException: GetField is not valid on fields of 
 type DoubleType;
 {code}
 We should either handle periods in column names or check during loading and 
 warn/fail gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6293) SQLContext.implicits should provide automatic conversion for RDD[Row]

2015-03-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6293:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-6116

 SQLContext.implicits should provide automatic conversion for RDD[Row]
 -

 Key: SPARK-6293
 URL: https://issues.apache.org/jira/browse/SPARK-6293
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 When a DataFrame is converted to an RDD[Row], it should be easier to convert 
 it back to a DataFrame via toDF.  E.g.:
 {code}
 val df: DataFrame = myRDD.toDF(col1, col2)  // This works for types like 
 RDD[scala.Tuple2[...]]
 val splits = df.rdd.randomSplit(...)
 val split0: RDD[Row] = splits(0)
 val df0 = split0.toDF(col1, col2) // This fails
 {code}
 The failure happens because SQLContext.implicits does not provide an 
 automatic conversion for Rows.  (It does handle Products, but Row does not 
 implement Product.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6295) spark.ml.Evaluator should have evaluate method not taking ParamMap

2015-03-11 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-6295:


 Summary: spark.ml.Evaluator should have evaluate method not taking 
ParamMap
 Key: SPARK-6295
 URL: https://issues.apache.org/jira/browse/SPARK-6295
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor


spark.ml.Evaluator requires that the user pass a ParamMap, but it is not always 
necessary.  It should have a default implementation with no ParamMap (similar 
to fit() and transform() in Estimator and Transformer).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6296) Add equals operator to Column (v1.3)

2015-03-11 Thread Volodymyr Lyubinets (JIRA)
Volodymyr Lyubinets created SPARK-6296:
--

 Summary: Add equals operator to Column (v1.3)
 Key: SPARK-6296
 URL: https://issues.apache.org/jira/browse/SPARK-6296
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Volodymyr Lyubinets
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6294) PySpark task may hang while call take() on in Java/Scala

2015-03-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6294:
-
Target Version/s: 1.2.2, 1.4.0, 1.3.1  (was: 1.2.2, 1.3.1)

 PySpark task may hang while call take() on in Java/Scala
 

 Key: SPARK-6294
 URL: https://issues.apache.org/jira/browse/SPARK-6294
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0, 1.2.1
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical

 {code}
  rdd = sc.parallelize(range(120)).map(lambda x: str(x))
  rdd._jrdd.first()
 {code}
 There is the stacktrace while hanging:
 {code}
 Executor task launch worker-5 daemon prio=10 tid=0x7f8fd01a9800 
 nid=0x566 in Object.wait() [0x7f90481d7000]
java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on 0x000630929340 (a 
 org.apache.spark.api.python.PythonRDD$WriterThread)
   at java.lang.Thread.join(Thread.java:1281)
   - locked 0x000630929340 (a 
 org.apache.spark.api.python.PythonRDD$WriterThread)
   at java.lang.Thread.join(Thread.java:1355)
   at 
 org.apache.spark.api.python.PythonRDD$$anonfun$compute$1.apply(PythonRDD.scala:78)
   at 
 org.apache.spark.api.python.PythonRDD$$anonfun$compute$1.apply(PythonRDD.scala:76)
   at 
 org.apache.spark.TaskContextImpl$$anon$1.onTaskCompletion(TaskContextImpl.scala:49)
   at 
 org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68)
   at 
 org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66)
   at org.apache.spark.scheduler.Task.run(Task.scala:58)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356399#comment-14356399
 ] 

Yu Ishikawa commented on SPARK-2429:


[~rnowling]

I apologize for the delay in replying. I'm still working on this.
I will reply the feedback from Freeman and Owen ASSP.

By the way, I have a question about the future course of action.
I implemented another algorithm which is more scalable and 1000 times faster 
than current one. 
Should we continue the PR, or replace it with the new implementation?

https://github.com/yu-iskw/more-scalable-hierarchical-clustering-with-spark

It is difficult to run the current one with an argument of the number of large 
clusters, such as 1. Because the dividing processes are executed 
one-by-one. The dividing processes of the new one is parallel processing. 
That's because it is more scalable and much faster than the current one.

thanks

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark

2015-03-11 Thread Meethu Mathew (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356428#comment-14356428
 ] 

Meethu Mathew commented on SPARK-6227:
--

Interested to work on this ticket.Could anyone assign to it to me?

 PCA and SVD for PySpark
 ---

 Key: SPARK-6227
 URL: https://issues.apache.org/jira/browse/SPARK-6227
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.2.1
Reporter: Julien Amelot
Priority: Minor

 The Dimensionality Reduction techniques are not available via Python (Scala + 
 Java only).
 * Principal component analysis (PCA)
 * Singular value decomposition (SVD)
 Doc:
 http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6278) Mention the change of step size in the migration guide

2015-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356447#comment-14356447
 ] 

Apache Spark commented on SPARK-6278:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4978

 Mention the change of step size in the migration guide
 --

 Key: SPARK-6278
 URL: https://issues.apache.org/jira/browse/SPARK-6278
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 We changed the objective from 1/n ..  to 1/(2n) ... in 1.3, so using the same 
 step size will lead to different results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6277) Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf

2015-03-11 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356478#comment-14356478
 ] 

Jianshi Huang commented on SPARK-6277:
--

I see. Not relating to hadoop config is fine, how about env variables? It's 
quite often that I want to change one setting for particular tasks, editing 
spark-defaults.conf everytime is inconvenient. Env variables is a best fit here 
because of its dynamic scope.

Typesafe's config has similar features for having env variables and it can even 
allow it override previous settings.

  
https://github.com/typesafehub/config#optional-system-or-env-variable-overrides

Jianshi

 Allow Hadoop configurations and env variables to be referenced in 
 spark-defaults.conf
 -

 Key: SPARK-6277
 URL: https://issues.apache.org/jira/browse/SPARK-6277
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.3.0, 1.2.1
Reporter: Jianshi Huang

 I need to set spark.local.dir to use user local home instead of /tmp, but 
 currently spark-defaults.conf can only allow constant values.
 What I want to do is to write:
 bq. spark.local.dir /home/${user.name}/spark/tmp
 or
 bq. spark.local.dir /home/${USER}/spark/tmp
 Otherwise I would have to hack bin/spark-class and pass the option through 
 -Dspark.local.dir
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6281) Support incremental updates for Graph

2015-03-11 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-6281:
---

 Summary: Support incremental updates for Graph
 Key: SPARK-6281
 URL: https://issues.apache.org/jira/browse/SPARK-6281
 Project: Spark
  Issue Type: New Feature
  Components: GraphX
Reporter: Takeshi Yamamuro
Priority: Minor


Add api to efficiently append new vertices and edges into existing Graph,
e.g., Graph#append(newVerts: RDD[(VertexId, VD)], newEdges: RDD[Edge[ED]], 
defaultVertexAttr: VD)

This is useful for time-evolving graphs; new vertices and edges are built from
streaming data thru Spark Streaming, and then incrementally appended
into a existing graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6276) beeline client class cast exception on partitioned table Spark SQL

2015-03-11 Thread Sunil (JIRA)
Sunil created SPARK-6276:


 Summary: beeline client class cast exception on partitioned table 
Spark SQL
 Key: SPARK-6276
 URL: https://issues.apache.org/jira/browse/SPARK-6276
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Sunil


When i am accessing partitioned hive table using spark thrift server and 
beeline. Throwing following error on partitioned column

java.lang.RuntimeException: java.lang.ClassCastException

at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:84)

at 
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)

results shown if i remove the partitioned column from select.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6278) Mention the change of step size in the migration guide

2015-03-11 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-6278:


 Summary: Mention the change of step size in the migration guide
 Key: SPARK-6278
 URL: https://issues.apache.org/jira/browse/SPARK-6278
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


We changed the objective from 1/n ..  to 1/(2n) ... in 1.3, so using the same 
step size will lead to different results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6022) GraphX `diff` test incorrectly operating on values (not VertexId's)

2015-03-11 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356452#comment-14356452
 ] 

Takeshi Yamamuro commented on SPARK-6022:
-

Yeah, ISTM it'd be better to add set difference as Graph#minus.

 GraphX `diff` test incorrectly operating on values (not VertexId's)
 ---

 Key: SPARK-6022
 URL: https://issues.apache.org/jira/browse/SPARK-6022
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Brennon York

 The current GraphX {{diff}} test operates on values rather than the 
 VertexId's and, if {{diff}} were working properly (per 
 [SPARK-4600|https://issues.apache.org/jira/browse/SPARK-4600]), it should 
 fail this test. The code to test {{diff}} should look like the below as it 
 correctly generates {{VertexRDD}}'s with different {{VertexId}}'s to {{diff}} 
 against.
 {code}
 test(diff functionality with small concrete values) {
 withSpark { sc =
   val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 2L).map(id 
 = (id, id.toInt)))
   // setA := Set((0L, 0), (1L, 1))
   val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(1L until 3L).map(id 
 = (id, id.toInt+2)))
   // setB := Set((1L, 3), (2L, 4))
   val diff = setA.diff(setB)
   assert(diff.collect.toSet == Set((2L, 4)))
 }
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6244) Implement VectorSpace to easy create a complicated feature vector

2015-03-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356464#comment-14356464
 ] 

Sean Owen commented on SPARK-6244:
--

You can reuse this JIRA but you would have to modify the title and description. 
You should close your PR and make a new one.
At best we are talking about adding a few utility methods to {{Vectors}}, but 
what is the use case for these? yes, I can imagine some, but would it refactor 
some repeated usages in the code?

I don't think MLlib intends to contain a full suite of vector utility methods 
as that is what Breeze is used for. I'd rather not add utility code just for 
its own sake. But if there's a clear common use for these things it could make 
sense.

 Implement VectorSpace to easy create a complicated feature vector
 -

 Key: SPARK-6244
 URL: https://issues.apache.org/jira/browse/SPARK-6244
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Kirill A. Korinskiy
Priority: Minor

 VectorSpace is wrapper what implement three operation:
  - concat -- concat all vectors to single vector
  - sum -- sum of vectors
  - scaled -- multiple scalar to vector
  
 Example of usage:
 ```
 import org.apache.spark.mllib.linalg.Vectors
 import org.apache.spark.mllib.linalg.VectorSpace
 // Create a new Vector Space with one dense vector.
 val vs = VectorSpace.create(Vectors.dense(1.0, 0.0, 3.0))
 // Add a to vector space a scaled vector space
 val vs2 = vs.add(vs.scaled(-1d))
 // concat vectors from vector space, result: Vectors.dense(1.0, 0.0, 3.0, 
 -1.0, 0.0, -3.0)
 val concat = vs2.concat
 // take a sum from vector space, result: Vectors.dense(0.0, 0.0, 0.0)
 val sum = vs2.sum
 ```
 This wrapper is very useful when create a complicated feature vector from 
 structured objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6280) Remove Akka systemName from Spark

2015-03-11 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-6280:
---

 Summary: Remove Akka systemName from Spark
 Key: SPARK-6280
 URL: https://issues.apache.org/jira/browse/SPARK-6280
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Shixiong Zhu


`systemName` is a Akka concept. A RPC implementation does not need to support 
it. 

We can hard code the system name in Spark and hide it in the internal Akka RPC 
implementation.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6277) Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf

2015-03-11 Thread Jianshi Huang (JIRA)
Jianshi Huang created SPARK-6277:


 Summary: Allow Hadoop configurations and env variables to be 
referenced in spark-defaults.conf
 Key: SPARK-6277
 URL: https://issues.apache.org/jira/browse/SPARK-6277
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.2.1, 1.3.0
Reporter: Jianshi Huang


I need to set spark.local.dir to use user local home instead of /tmp, but 
currently spark-defaults.conf can only allow constant values.

What I want to do is to write:

bq. spark.local.dir /home/${user.name}/spark/tmp
or
bq. spark.local.dir /home/${USER}/spark/tmp

Otherwise I would have to hack bin/spark-class and pass the option through 
-Dspark.local.dir

Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6185) Deltele repeated TOKEN. TOK_CREATEFUNCTION has existed at Line 84;

2015-03-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6185.
--
Resolution: Duplicate

Rather than make a new JIRA, you could have expanded the scope of this one by 
changing its title and description. At least please close the old one if you 
reopen a new one for a different take on the same change.

   Deltele repeated TOKEN. TOK_CREATEFUNCTION has existed at Line 84;
 -

 Key: SPARK-6185
 URL: https://issues.apache.org/jira/browse/SPARK-6185
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.1
Reporter: DoingDone9
Priority: Trivial

 TOK_CREATEFUNCTION has existed at Line 84;
 Line 84TOK_CREATEFUNCTION,
 Line 85 TOK_DROPFUNCTION,
 Line 106  TOK_CREATEFUNCTION,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6279) Miss expressions flag s at logging string

2015-03-11 Thread zzc (JIRA)
zzc created SPARK-6279:
--

 Summary: Miss expressions flag s at logging string 
 Key: SPARK-6279
 URL: https://issues.apache.org/jira/browse/SPARK-6279
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: zzc
Priority: Minor


In KafkaRDD.scala, Miss expressions flag s at logging string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6279) Miss expressions flag s at logging string

2015-03-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356472#comment-14356472
 ] 

Sean Owen commented on SPARK-6279:
--

(Don't bother making a JIRA for trivial issues, where the description of the 
problem is largely identical to the fix.)

 Miss expressions flag s at logging string 
 

 Key: SPARK-6279
 URL: https://issues.apache.org/jira/browse/SPARK-6279
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: zzc
Priority: Minor

 In KafkaRDD.scala, Miss expressions flag s at logging string
 In logging file, it print `Beginning offset ${part.fromOffset} is the same as 
 ending offset ` but not `log.warn(Beginning offset 111 is the same as ending 
 offset `.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4820) Spark build encounters File name too long on some encrypted filesystems

2015-03-11 Thread Theodore Vasiloudis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356688#comment-14356688
 ] 

Theodore Vasiloudis commented on SPARK-4820:


Can confirm for 1.2.1 in encfs as well, had to build /tmp instead.

 Spark build encounters File name too long on some encrypted filesystems
 -

 Key: SPARK-4820
 URL: https://issues.apache.org/jira/browse/SPARK-4820
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell

 This was reported by Luchesar Cekov on github along with a proposed fix. The 
 fix has some potential downstream issues (it will modify the classnames) so 
 until we understand better how many users are affected we aren't going to 
 merge it. However, I'd like to include the issue and workaround here. If you 
 encounter this issue please comment on the JIRA so we can assess the 
 frequency.
 The issue produces this error:
 {code}
 [error] == Expanded type of tree ==
 [error] 
 [error] ConstantType(value = Constant(Throwable))
 [error] 
 [error] uncaught exception during compilation: java.io.IOException
 [error] File name too long
 [error] two errors found
 {code}
 The workaround is in maven under the compile options add: 
 {code}
 +  arg-Xmax-classfile-name/arg
 +  arg128/arg
 {code}
 In SBT add:
 {code}
 +scalacOptions in Compile ++= Seq(-Xmax-classfile-name, 128),
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5499) iterative computing with 1000 iterations causes stage failure

2015-03-11 Thread Tien-Dung LE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356699#comment-14356699
 ] 

Tien-Dung LE commented on SPARK-5499:
-

Many thanks for your reply. I did the same as your suggestion and it worked.

 iterative computing with 1000 iterations causes stage failure
 -

 Key: SPARK-5499
 URL: https://issues.apache.org/jira/browse/SPARK-5499
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Tien-Dung LE

 I got an error org.apache.spark.SparkException: Job aborted due to stage 
 failure: Task serialization failed: java.lang.StackOverflowError when 
 executing an action with 1000 transformations.
 Here is a code snippet to re-produce the error:
 {code}
   import org.apache.spark.rdd.RDD
   var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
 var newPair: RDD[(Long,Long)] = null
 for (i - 1 to 1000) {
   newPair = pair.map(_.swap)
   pair = newPair
 }
 println(Count =  + pair.count())
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2015-03-11 Thread Beniamino (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356741#comment-14356741
 ] 

Beniamino commented on SPARK-2344:
--

Hi,

yes the computation of the next centers are made on the fly avoiding to store 
the membership matrix. 

The algorithm already works; the only thing that might be added is the run 
parameter such as the K-Means' implementation.

I've already done the Fukuyama Sugeno validity index computation too.

Beniamino

 Add Fuzzy C-Means algorithm to MLlib
 

 Key: SPARK-2344
 URL: https://issues.apache.org/jira/browse/SPARK-2344
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Alex
Priority: Minor
  Labels: clustering
   Original Estimate: 1m
  Remaining Estimate: 1m

 I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.
 FCM is very similar to K - Means which is already implemented, and they 
 differ only in the degree of relationship each point has with each cluster:
 (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.
 As part of the implementation I would like:
 - create a base class for K- Means and FCM
 - implement the relationship for each algorithm differently (in its class)
 I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5499) iterative computing with 1000 iterations causes stage failure

2015-03-11 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE closed SPARK-5499.
---

checkpoint mechanism can solve the issue.

 iterative computing with 1000 iterations causes stage failure
 -

 Key: SPARK-5499
 URL: https://issues.apache.org/jira/browse/SPARK-5499
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Tien-Dung LE

 I got an error org.apache.spark.SparkException: Job aborted due to stage 
 failure: Task serialization failed: java.lang.StackOverflowError when 
 executing an action with 1000 transformations.
 Here is a code snippet to re-produce the error:
 {code}
   import org.apache.spark.rdd.RDD
   var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
 var newPair: RDD[(Long,Long)] = null
 for (i - 1 to 1000) {
   newPair = pair.map(_.swap)
   pair = newPair
 }
 println(Count =  + pair.count())
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6067) Spark sql hive dynamic partitions job will fail if task fails

2015-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356745#comment-14356745
 ] 

Apache Spark commented on SPARK-6067:
-

User 'baishuo' has created a pull request for this issue:
https://github.com/apache/spark/pull/4980

 Spark sql hive dynamic partitions job will fail if task fails
 -

 Key: SPARK-6067
 URL: https://issues.apache.org/jira/browse/SPARK-6067
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jason Hubbard
Priority: Minor
 Attachments: job.log


 When inserting into a hive table from spark sql while using dynamic 
 partitioning, if a task fails it will cause the task to continue to fail and 
 eventually fail the job.
 /mytable/.hive-staging_hive_2015-02-27_11-53-19_573_222-3/-ext-1/partition=2015-02-04/part-1
  for client ip already exists
 The task may need to clean up after a failed task to write to the location of 
 the previously failed task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-03-11 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356507#comment-14356507
 ] 

Yu Ishikawa commented on SPARK-5992:


What kind of LSH algorithms should we support at first?

- Bit sampling for Hamming distance
- Min-wise independent permutations
- Nilsimsa Hash
- Random projection
- Min Hash

 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6277) Allow Hadoop configurations and env variables to be referenced in spark-defaults.conf

2015-03-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356450#comment-14356450
 ] 

Sean Owen commented on SPARK-6277:
--

Although it sounds nice, in practice this raises a number of issues, like 
having to then parse Hadoop config to parse the core Spark config, dealing with 
escapes, etc. Suddenly Hadoop config becomes required. For this particular 
problem, I don't think per-user home directories should be used, though if you 
must, you've already identified how to do that.

 Allow Hadoop configurations and env variables to be referenced in 
 spark-defaults.conf
 -

 Key: SPARK-6277
 URL: https://issues.apache.org/jira/browse/SPARK-6277
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.3.0, 1.2.1
Reporter: Jianshi Huang

 I need to set spark.local.dir to use user local home instead of /tmp, but 
 currently spark-defaults.conf can only allow constant values.
 What I want to do is to write:
 bq. spark.local.dir /home/${user.name}/spark/tmp
 or
 bq. spark.local.dir /home/${USER}/spark/tmp
 Otherwise I would have to hack bin/spark-class and pass the option through 
 -Dspark.local.dir
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5814) Remove JBLAS from runtime dependencies

2015-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356451#comment-14356451
 ] 

Apache Spark commented on SPARK-5814:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4699

 Remove JBLAS from runtime dependencies
 --

 Key: SPARK-5814
 URL: https://issues.apache.org/jira/browse/SPARK-5814
 Project: Spark
  Issue Type: Dependency upgrade
  Components: GraphX, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 We are using mixed breeze/netlib-java and jblas code in MLlib. They take 
 different approaches to utilize native libraries and we should keep only one 
 of them. netlib-java has a clear separation between Java implementation and 
 native JNI libraries, while JBLAS packs statically linked binaries that 
 causes license issues (SPARK-5669). So we want to remove JBLAS from Spark 
 runtime.
 One issue with this approach is that we have JBLAS' DoubleMatrix exposed (by 
 mistake) in SVDPlusPlus of GraphX. We should deprecate it and replace 
 `DoubleMatrix` by `Array[Double]`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4924) Factor out code to launch Spark applications into a separate library

2015-03-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4924.

   Resolution: Fixed
Fix Version/s: 1.4.0

Glad to finally have this in. Thanks for all the hard work [~vanzin]!

 Factor out code to launch Spark applications into a separate library
 

 Key: SPARK-4924
 URL: https://issues.apache.org/jira/browse/SPARK-4924
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Fix For: 1.4.0

 Attachments: spark-launcher.txt


 One of the questions we run into rather commonly is how to start a Spark 
 application from my Java/Scala program?. There currently isn't a good answer 
 to that:
 - Instantiating SparkContext has limitations (e.g., you can only have one 
 active context at the moment, plus you lose the ability to submit apps in 
 cluster mode)
 - Calling SparkSubmit directly is doable but you lose a lot of the logic 
 handled by the shell scripts
 - Calling the shell script directly is doable,  but sort of ugly from an API 
 point of view.
 I think it would be nice to have a small library that handles that for users. 
 On top of that, this library could be used by Spark itself to replace a lot 
 of the code in the current shell scripts, which have a lot of duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3438) Support for accessing secured HDFS in Standalone Mode

2015-03-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3438.
--
Resolution: Duplicate

 Support for accessing secured HDFS in Standalone Mode
 -

 Key: SPARK-3438
 URL: https://issues.apache.org/jira/browse/SPARK-3438
 Project: Spark
  Issue Type: New Feature
  Components: Deploy, Spark Core
Affects Versions: 1.0.2
Reporter: Zhanfeng Huo

 Access to secured HDFS is currently supported in YARN using YARN's built in 
 security mechanism. In YARN mode, a user application is authenticated when it 
 is submitted, then it acquires delegation tokens and them ship them (via 
 YARN) securely to workers.
 In Standalone mode, it would be nice to support a more mechanism for 
 accessing HDFS where we rely on a single shared secret to authenticate 
 communication in the standalone cluster.
 1. A company is running a standalone cluster.
 2. They are fine if all Spark jobs in the cluster share a global secret, i.e. 
 all Spark jobs can trust one another.
 3. They are able to provide a Hadoop login on the driver node via a keytab or 
 kinit. They want tokens from this login to be distributed to the executors to 
 allow access to secure HDFS.
 4. They also don't want to trust the network on the cluster. I.e. don't want 
 to allow someone to fetch HDFS tokens easily over a known protocol, without 
 authentication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6279) Miss expressions flag s at logging string

2015-03-11 Thread zzc (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zzc updated SPARK-6279:
---
Description: 
In KafkaRDD.scala, Miss expressions flag s at logging string

In logging file, it print `Beginning offset ${part.fromOffset} is the same as 
ending offset ` but not `log.warn(Beginning offset 111 is the same as ending 
offset `.

  was:In KafkaRDD.scala, Miss expressions flag s at logging string


 Miss expressions flag s at logging string 
 

 Key: SPARK-6279
 URL: https://issues.apache.org/jira/browse/SPARK-6279
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: zzc
Priority: Minor

 In KafkaRDD.scala, Miss expressions flag s at logging string
 In logging file, it print `Beginning offset ${part.fromOffset} is the same as 
 ending offset ` but not `log.warn(Beginning offset 111 is the same as ending 
 offset `.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-03-11 Thread Masayoshi TSUZUKI (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348309#comment-14348309
 ] 

Masayoshi TSUZUKI edited comment on SPARK-5389 at 3/11/15 8:18 AM:
---

The crashed program findstr.exe in the screenshot seems not to be the one in 
the C:\Windows\System32 directory.
I'm not sure but I think C:\Windows\System32\findstr.exe in Windows 7 shows 
(QGREP) utililty but not (grep) utility.
(Although I don't know the exact English name since I'm not using English 
version of Windows.)

[~yanakad], [~s@r@v@n@n], and [SPARK-6084] seem to be reporting the similar 
problems.
Their workarounds show that the cause might be the polluted %PATH%.
The collision of find.exe is well known phenomenon in Windows, but like 
Linux, the order of %PATH% can control which program is called.
If you face the similar problem, you can check by executing the command 
{{-whereas find-}} {{where find}} to check if the proper program find.exe is 
used.

Would you mind attaching the result of these commands?
{quote}
  where find
  where findstr
  echo %PATH%
{quote}


was (Author: tsudukim):
The crashed program findstr.exe in the screenshot seems not to be the one in 
the C:\Windows\System32 directory.
I'm not sure but I think C:\Windows\System32\findstr.exe in Windows 7 shows 
(QGREP) utililty but not (grep) utility.
(Although I don't know the exact English name since I'm not using English 
version of Windows.)

[~yanakad], [~s@r@v@n@n], and [SPARK-6084] seem to be reporting the similar 
problems.
Their workarounds show that the cause might be the polluted %PATH%.
The collision of find.exe is well known phenomenon in Windows, but like 
Linux, the order of %PATH% can control which program is called.
If you face the similar problem, you can check by executing the command 
{{whereas find}} to check if the proper program find.exe is used.

Would you mind attaching the result of these commands?
{quote}
  where find
  where findstr
  echo %PATH%
{quote}

 spark-shell.cmd does not run from DOS Windows 7
 ---

 Key: SPARK-5389
 URL: https://issues.apache.org/jira/browse/SPARK-5389
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
 Environment: Windows 7
Reporter: Yana Kadiyska
 Attachments: SparkShell_Win7.JPG


 spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
 spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
 Marking as trivial since calling spark-shell2.cmd also works fine
 Attaching a screenshot since the error isn't very useful:
 {code}
 spark-1.2.0-bin-cdh4bin\spark-shell.cmd
 else was unexpected at this time.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6279) Miss expressions flag s at logging string

2015-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356473#comment-14356473
 ] 

Apache Spark commented on SPARK-6279:
-

User 'zzcclp' has created a pull request for this issue:
https://github.com/apache/spark/pull/4979

 Miss expressions flag s at logging string 
 

 Key: SPARK-6279
 URL: https://issues.apache.org/jira/browse/SPARK-6279
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: zzc
Priority: Minor

 In KafkaRDD.scala, Miss expressions flag s at logging string
 In logging file, it print `Beginning offset ${part.fromOffset} is the same as 
 ending offset ` but not `log.warn(Beginning offset 111 is the same as ending 
 offset `.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6225) Resolve most build warnings, 1.3.0 edition

2015-03-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6225.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4950
[https://github.com/apache/spark/pull/4950]

 Resolve most build warnings, 1.3.0 edition
 --

 Key: SPARK-6225
 URL: https://issues.apache.org/jira/browse/SPARK-6225
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Spark Core, SQL, Streaming
Affects Versions: 1.3.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.4.0


 Post-1.3.0, I think it would be a good exercise to resolve a number of build 
 warnings that have accumulated recently.
 See for example efforts begun at
 https://github.com/apache/spark/pull/4948
 https://github.com/apache/spark/pull/4900



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6228) Provide SASL support in network/common module

2015-03-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6228.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4953
[https://github.com/apache/spark/pull/4953]

 Provide SASL support in network/common module
 -

 Key: SPARK-6228
 URL: https://issues.apache.org/jira/browse/SPARK-6228
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Marcelo Vanzin
 Fix For: 1.4.0


 Currently, there's support for SASL in network/shuffle, but not in 
 network/common. Moving the SASL code to network/common would enable other 
 applications using that code to also support secure authentication and, 
 later, encryption.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6228) Provide SASL support in network/common module

2015-03-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6228:
-
Priority: Minor  (was: Major)
Assignee: Marcelo Vanzin

 Provide SASL support in network/common module
 -

 Key: SPARK-6228
 URL: https://issues.apache.org/jira/browse/SPARK-6228
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Minor
 Fix For: 1.4.0


 Currently, there's support for SASL in network/shuffle, but not in 
 network/common. Moving the SASL code to network/common would enable other 
 applications using that code to also support secure authentication and, 
 later, encryption.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4423) Improve foreach() documentation to avoid confusion between local- and cluster-mode behavior

2015-03-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4423.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4696
[https://github.com/apache/spark/pull/4696]

 Improve foreach() documentation to avoid confusion between local- and 
 cluster-mode behavior
 ---

 Key: SPARK-4423
 URL: https://issues.apache.org/jira/browse/SPARK-4423
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Josh Rosen
Assignee: Ilya Ganelin
 Fix For: 1.4.0


 {{foreach}} seems to be a common source of confusion for new users: in 
 {{local}} mode, {{foreach}} can be used to update local variables on the 
 driver, but programs that do this will not work properly when executed on 
 clusters, since the {{foreach}} will update per-executor variables (note that 
 this _will_ work correctly for accumulators, but not for other types of 
 mutable objects).
 Similarly, I've seen users become confused when {{.foreach(println)}} doesn't 
 print to the driver's standard output.
 At a minimum, we should improve the documentation to warn users against 
 unsafe uses of {{foreach}} that won't work properly when transitioning from 
 local mode to a real cluster.
 We might also consider changes to local mode so that its behavior more 
 closely matches the cluster modes; this will require some discussion, though, 
 since any change of behavior here would technically be a user-visible 
 backwards-incompatible change (I don't think that we made any explicit 
 guarantees about the current local-mode behavior, but someone might be 
 relying on the current implicit behavior).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >