[jira] [Commented] (SPARK-3147) Implement A/B testing

2015-02-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333086#comment-14333086
 ] 

Xiangrui Meng commented on SPARK-3147:
--

Done:)

> Implement A/B testing
> -
>
> Key: SPARK-3147
> URL: https://issues.apache.org/jira/browse/SPARK-3147
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, Streaming
>Reporter: Xiangrui Meng
>Assignee: Feynman Liang
>
> A/B testing is widely used to compare online models. We can implement A/B 
> testing in MLlib and integrate it with Spark Streaming. For example, we have 
> a PairDStream[String, Double], whose keys are model ids and values are 
> observations (click or not, or revenue associated with the event). With A/B 
> testing, we can tell whether one model is significantly better than another 
> at a certain time. There are some caveats. For example, we should avoid 
> multiple testing and support A/A testing as a sanity check.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3147) Implement A/B testing

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3147:
-
Assignee: Feynman Liang

> Implement A/B testing
> -
>
> Key: SPARK-3147
> URL: https://issues.apache.org/jira/browse/SPARK-3147
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, Streaming
>Reporter: Xiangrui Meng
>Assignee: Feynman Liang
>
> A/B testing is widely used to compare online models. We can implement A/B 
> testing in MLlib and integrate it with Spark Streaming. For example, we have 
> a PairDStream[String, Double], whose keys are model ids and values are 
> observations (click or not, or revenue associated with the event). With A/B 
> testing, we can tell whether one model is significantly better than another 
> at a certain time. There are some caveats. For example, we should avoid 
> multiple testing and support A/A testing as a sanity check.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4289) Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance.

2015-02-23 Thread Alexander Bezzubov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333118#comment-14333118
 ] 

Alexander Bezzubov commented on SPARK-4289:
---

Could you please tell how exactly ":silent workaround" look like in the 
spark-shell with the context of the initial example?

> Creating an instance of Hadoop Job fails in the Spark shell when toString() 
> is called on the instance.
> --
>
> Key: SPARK-4289
> URL: https://issues.apache.org/jira/browse/SPARK-4289
> Project: Spark
>  Issue Type: Bug
>Reporter: Corey J. Nolet
>
> This one is easy to reproduce.
> val job = new Job(sc.hadoopConfiguration)
> I'm not sure what the solution would be off hand as it's happening when the 
> shell is calling toString() on the instance of Job. The problem is, because 
> of the failure, the instance is never actually assigned to the job val.
> java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
>   at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283)
>   at org.apache.hadoop.mapreduce.Job.toString(Job.java:452)
>   at 
> scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
>   at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
>   at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
>   at .(:10)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
>   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
>   at org.apache.spark.repl.Main$.main(Main.scala:31)
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4289) Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance.

2015-02-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333127#comment-14333127
 ] 

Sean Owen commented on SPARK-4289:
--

[~bzz] Just type {{:silent}} into the shell at the start.

> Creating an instance of Hadoop Job fails in the Spark shell when toString() 
> is called on the instance.
> --
>
> Key: SPARK-4289
> URL: https://issues.apache.org/jira/browse/SPARK-4289
> Project: Spark
>  Issue Type: Bug
>Reporter: Corey J. Nolet
>
> This one is easy to reproduce.
> val job = new Job(sc.hadoopConfiguration)
> I'm not sure what the solution would be off hand as it's happening when the 
> shell is calling toString() on the instance of Job. The problem is, because 
> of the failure, the instance is never actually assigned to the job val.
> java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
>   at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283)
>   at org.apache.hadoop.mapreduce.Job.toString(Job.java:452)
>   at 
> scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
>   at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
>   at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
>   at .(:10)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
>   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
>   at org.apache.spark.repl.Main$.main(Main.scala:31)
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4289) Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance.

2015-02-23 Thread Alexander Bezzubov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333142#comment-14333142
 ] 

Alexander Bezzubov commented on SPARK-4289:
---

[~sowen] Thanks, that's what I though, but it results in exactly same exception 
and does not mute any interpreter output on the fresh built spark master with 
-Phadoop2.4
Am I doing something wrong?

Looks like another workaround is not to assign it to any variable, as advised 
in [@user|http://markmail.org/message/x77s57w47homqn6x] 

> Creating an instance of Hadoop Job fails in the Spark shell when toString() 
> is called on the instance.
> --
>
> Key: SPARK-4289
> URL: https://issues.apache.org/jira/browse/SPARK-4289
> Project: Spark
>  Issue Type: Bug
>Reporter: Corey J. Nolet
>
> This one is easy to reproduce.
> val job = new Job(sc.hadoopConfiguration)
> I'm not sure what the solution would be off hand as it's happening when the 
> shell is calling toString() on the instance of Job. The problem is, because 
> of the failure, the instance is never actually assigned to the job val.
> java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
>   at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283)
>   at org.apache.hadoop.mapreduce.Job.toString(Job.java:452)
>   at 
> scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
>   at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
>   at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
>   at .(:10)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
>   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
>   at org.apache.spark.repl.Main$.main(Main.scala:31)
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4289) Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance.

2015-02-23 Thread Alexander Bezzubov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333142#comment-14333142
 ] 

Alexander Bezzubov edited comment on SPARK-4289 at 2/23/15 9:26 AM:


[~sowen] Thanks, that's what I though, but it results in exactly same exception 
and does not mute any interpreter output on the fresh built spark master with 
-Phadoop2.4
Am I doing something wrong?

Looks like another workaround is not to assign it to any variable, as advised 
in [@user thread|http://markmail.org/message/x77s57w47homqn6x] 


was (Author: bzz):
[~sowen] Thanks, that's what I though, but it results in exactly same exception 
and does not mute any interpreter output on the fresh built spark master with 
-Phadoop2.4
Am I doing something wrong?

Looks like another workaround is not to assign it to any variable, as advised 
in [@user|http://markmail.org/message/x77s57w47homqn6x] 

> Creating an instance of Hadoop Job fails in the Spark shell when toString() 
> is called on the instance.
> --
>
> Key: SPARK-4289
> URL: https://issues.apache.org/jira/browse/SPARK-4289
> Project: Spark
>  Issue Type: Bug
>Reporter: Corey J. Nolet
>
> This one is easy to reproduce.
> val job = new Job(sc.hadoopConfiguration)
> I'm not sure what the solution would be off hand as it's happening when the 
> shell is calling toString() on the instance of Job. The problem is, because 
> of the failure, the instance is never actually assigned to the job val.
> java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
>   at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283)
>   at org.apache.hadoop.mapreduce.Job.toString(Job.java:452)
>   at 
> scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
>   at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
>   at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
>   at .(:10)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
>   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
>   at org.apache.spark.repl.Main$.main(Main.scala:31)
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4289) Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance.

2015-02-23 Thread Alexander Bezzubov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333142#comment-14333142
 ] 

Alexander Bezzubov edited comment on SPARK-4289 at 2/23/15 9:39 AM:


[~sowen] Thanks, that's what I though, but it results in exactly same exception 
and does not mute any interpreter output on the fresh built spark master with 
-Phadoop2.4
Am I doing something wrong?

Uneducated guess: is that something to do with [SparkILoop . 
verbosity()|https://github.com/apache/spark/blob/16687651f05bde8ff2e2fcef100383168958bf7f/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L796]
 impl?

Looks like another workaround is not to assign it to any variable, as advised 
in [@user thread|http://markmail.org/message/x77s57w47homqn6x] 


was (Author: bzz):
[~sowen] Thanks, that's what I though, but it results in exactly same exception 
and does not mute any interpreter output on the fresh built spark master with 
-Phadoop2.4
Am I doing something wrong?

Looks like another workaround is not to assign it to any variable, as advised 
in [@user thread|http://markmail.org/message/x77s57w47homqn6x] 

> Creating an instance of Hadoop Job fails in the Spark shell when toString() 
> is called on the instance.
> --
>
> Key: SPARK-4289
> URL: https://issues.apache.org/jira/browse/SPARK-4289
> Project: Spark
>  Issue Type: Bug
>Reporter: Corey J. Nolet
>
> This one is easy to reproduce.
> val job = new Job(sc.hadoopConfiguration)
> I'm not sure what the solution would be off hand as it's happening when the 
> shell is calling toString() on the instance of Job. The problem is, because 
> of the failure, the instance is never actually assigned to the job val.
> java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
>   at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283)
>   at org.apache.hadoop.mapreduce.Job.toString(Job.java:452)
>   at 
> scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
>   at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
>   at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
>   at .(:10)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
>   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
>   at org.apache.spark.repl.Main$.main(Main.scala:31)
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (SPARK-4289) Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance.

2015-02-23 Thread Alexander Bezzubov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333142#comment-14333142
 ] 

Alexander Bezzubov edited comment on SPARK-4289 at 2/23/15 9:39 AM:


[~sowen] Thanks, that's what I though, but it results in exactly same exception 
and does not mute any interpreter output on the fresh built spark master with 
-Phadoop2.4
Am I doing something wrong?

Uneducated guess: is that something to do with 
[SparkILoop.verbosity()|https://github.com/apache/spark/blob/16687651f05bde8ff2e2fcef100383168958bf7f/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L796]
 impl?

Looks like another workaround is not to assign it to any variable, as advised 
in [@user thread|http://markmail.org/message/x77s57w47homqn6x] 


was (Author: bzz):
[~sowen] Thanks, that's what I though, but it results in exactly same exception 
and does not mute any interpreter output on the fresh built spark master with 
-Phadoop2.4
Am I doing something wrong?

Uneducated guess: is that something to do with [SparkILoop . 
verbosity()|https://github.com/apache/spark/blob/16687651f05bde8ff2e2fcef100383168958bf7f/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L796]
 impl?

Looks like another workaround is not to assign it to any variable, as advised 
in [@user thread|http://markmail.org/message/x77s57w47homqn6x] 

> Creating an instance of Hadoop Job fails in the Spark shell when toString() 
> is called on the instance.
> --
>
> Key: SPARK-4289
> URL: https://issues.apache.org/jira/browse/SPARK-4289
> Project: Spark
>  Issue Type: Bug
>Reporter: Corey J. Nolet
>
> This one is easy to reproduce.
> val job = new Job(sc.hadoopConfiguration)
> I'm not sure what the solution would be off hand as it's happening when the 
> shell is calling toString() on the instance of Job. The problem is, because 
> of the failure, the instance is never actually assigned to the job val.
> java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
>   at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283)
>   at org.apache.hadoop.mapreduce.Job.toString(Job.java:452)
>   at 
> scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
>   at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
>   at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
>   at .(:10)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
>   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
>   at org.apache.spark.repl.Main$.main(Main.scala:31)
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:32

[jira] [Comment Edited] (SPARK-4289) Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance.

2015-02-23 Thread Alexander Bezzubov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333142#comment-14333142
 ] 

Alexander Bezzubov edited comment on SPARK-4289 at 2/23/15 9:39 AM:


[~sowen] Thanks, that's what I though, but it results in exactly same exception 
and does not mute any interpreter output on the fresh built spark master with 
-Phadoop2.4
Am I doing something wrong?

Uneducated guess: is that something to do with 
[SparkILoop.verbosity()|https://github.com/apache/spark/blob/16687651f05bde8ff2e2fcef100383168958bf7f/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L796]
 implementation?

Looks like another workaround is not to assign it to any variable, as advised 
in [@user thread|http://markmail.org/message/x77s57w47homqn6x] 


was (Author: bzz):
[~sowen] Thanks, that's what I though, but it results in exactly same exception 
and does not mute any interpreter output on the fresh built spark master with 
-Phadoop2.4
Am I doing something wrong?

Uneducated guess: is that something to do with 
[SparkILoop.verbosity()|https://github.com/apache/spark/blob/16687651f05bde8ff2e2fcef100383168958bf7f/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L796]
 impl?

Looks like another workaround is not to assign it to any variable, as advised 
in [@user thread|http://markmail.org/message/x77s57w47homqn6x] 

> Creating an instance of Hadoop Job fails in the Spark shell when toString() 
> is called on the instance.
> --
>
> Key: SPARK-4289
> URL: https://issues.apache.org/jira/browse/SPARK-4289
> Project: Spark
>  Issue Type: Bug
>Reporter: Corey J. Nolet
>
> This one is easy to reproduce.
> val job = new Job(sc.hadoopConfiguration)
> I'm not sure what the solution would be off hand as it's happening when the 
> shell is calling toString() on the instance of Job. The problem is, because 
> of the failure, the instance is never actually assigned to the job val.
> java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
>   at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283)
>   at org.apache.hadoop.mapreduce.Job.toString(Job.java:452)
>   at 
> scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
>   at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
>   at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
>   at .(:10)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
>   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
>   at org.apache.spark.repl.Main$.main(Main.scala:31)
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.

[jira] [Resolved] (SPARK-5943) Update the API to remove several warns in building for Spark Streaming

2015-02-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5943.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4722
[https://github.com/apache/spark/pull/4722]

> Update the API to remove several warns in building for Spark Streaming
> --
>
> Key: SPARK-5943
> URL: https://issues.apache.org/jira/browse/SPARK-5943
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.0
>Reporter: Saisai Shao
>Priority: Minor
> Fix For: 1.3.0
>
>
> old {{awaitTermination(timeout: Long)}} is deprecated and updated to 
> {{awaitTerminationOrTimeout(timeout: Long): Boolean}} in version 1.3, here 
> change the related code to reduce warns about this while compiling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5943) Update the API to remove several warns in building for Spark Streaming

2015-02-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5943:
-
Affects Version/s: (was: 1.3.0)
 Assignee: Saisai Shao

> Update the API to remove several warns in building for Spark Streaming
> --
>
> Key: SPARK-5943
> URL: https://issues.apache.org/jira/browse/SPARK-5943
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 1.3.0
>
>
> old {{awaitTermination(timeout: Long)}} is deprecated and updated to 
> {{awaitTerminationOrTimeout(timeout: Long): Boolean}} in version 1.3, here 
> change the related code to reduce warns about this while compiling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5724) misconfiguration in Akka system

2015-02-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5724.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4512
[https://github.com/apache/spark/pull/4512]

> misconfiguration in Akka system
> ---
>
> Key: SPARK-5724
> URL: https://issues.apache.org/jira/browse/SPARK-5724
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Nan Zhu
> Fix For: 1.4.0
>
>
> In AkkaUtil, we set several failure detector related the parameters as 
> following 
> {code:title=AkkaUtil.scala|borderStyle=solid}
> al akkaConf = ConfigFactory.parseMap(conf.getAkkaConf.toMap[String, String])
>   .withFallback(akkaSslConfig).withFallback(ConfigFactory.parseString(
>   s"""
>   |akka.daemonic = on
>   |akka.loggers = [""akka.event.slf4j.Slf4jLogger""]
>   |akka.stdout-loglevel = "ERROR"
>   |akka.jvm-exit-on-fatal-error = off
>   |akka.remote.require-cookie = "$requireCookie"
>   |akka.remote.secure-cookie = "$secureCookie"
>   |akka.remote.transport-failure-detector.heartbeat-interval = 
> $akkaHeartBeatInterval s
>   |akka.remote.transport-failure-detector.acceptable-heartbeat-pause = 
> $akkaHeartBeatPauses s
>   |akka.remote.transport-failure-detector.threshold = $akkaFailureDetector
>   |akka.actor.provider = "akka.remote.RemoteActorRefProvider"
>   |akka.remote.netty.tcp.transport-class = 
> "akka.remote.transport.netty.NettyTransport"
>   |akka.remote.netty.tcp.hostname = "$host"
>   |akka.remote.netty.tcp.port = $port
>   |akka.remote.netty.tcp.tcp-nodelay = on
>   |akka.remote.netty.tcp.connection-timeout = $akkaTimeout s
>   |akka.remote.netty.tcp.maximum-frame-size = ${akkaFrameSize}B
>   |akka.remote.netty.tcp.execution-pool-size = $akkaThreads
>   |akka.actor.default-dispatcher.throughput = $akkaBatchSize
>   |akka.log-config-on-start = $logAkkaConfig
>   |akka.remote.log-remote-lifecycle-events = $lifecycleEvents
>   |akka.log-dead-letters = $lifecycleEvents
>   |akka.log-dead-letters-during-shutdown = $lifecycleEvents
>   """.stripMargin))
> {code}
> Actually, we do not have any parameter naming 
> "akka.remote.transport-failure-detector.threshold"
> see: http://doc.akka.io/docs/akka/2.3.4/general/configuration.html
> what we have is "akka.remote.watch-failure-detector.threshold" 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5724) misconfiguration in Akka system

2015-02-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5724:
-
Priority: Minor  (was: Major)
Target Version/s:   (was: 1.3.0)
Assignee: Nan Zhu

> misconfiguration in Akka system
> ---
>
> Key: SPARK-5724
> URL: https://issues.apache.org/jira/browse/SPARK-5724
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Nan Zhu
>Assignee: Nan Zhu
>Priority: Minor
> Fix For: 1.4.0
>
>
> In AkkaUtil, we set several failure detector related the parameters as 
> following 
> {code:title=AkkaUtil.scala|borderStyle=solid}
> al akkaConf = ConfigFactory.parseMap(conf.getAkkaConf.toMap[String, String])
>   .withFallback(akkaSslConfig).withFallback(ConfigFactory.parseString(
>   s"""
>   |akka.daemonic = on
>   |akka.loggers = [""akka.event.slf4j.Slf4jLogger""]
>   |akka.stdout-loglevel = "ERROR"
>   |akka.jvm-exit-on-fatal-error = off
>   |akka.remote.require-cookie = "$requireCookie"
>   |akka.remote.secure-cookie = "$secureCookie"
>   |akka.remote.transport-failure-detector.heartbeat-interval = 
> $akkaHeartBeatInterval s
>   |akka.remote.transport-failure-detector.acceptable-heartbeat-pause = 
> $akkaHeartBeatPauses s
>   |akka.remote.transport-failure-detector.threshold = $akkaFailureDetector
>   |akka.actor.provider = "akka.remote.RemoteActorRefProvider"
>   |akka.remote.netty.tcp.transport-class = 
> "akka.remote.transport.netty.NettyTransport"
>   |akka.remote.netty.tcp.hostname = "$host"
>   |akka.remote.netty.tcp.port = $port
>   |akka.remote.netty.tcp.tcp-nodelay = on
>   |akka.remote.netty.tcp.connection-timeout = $akkaTimeout s
>   |akka.remote.netty.tcp.maximum-frame-size = ${akkaFrameSize}B
>   |akka.remote.netty.tcp.execution-pool-size = $akkaThreads
>   |akka.actor.default-dispatcher.throughput = $akkaBatchSize
>   |akka.log-config-on-start = $logAkkaConfig
>   |akka.remote.log-remote-lifecycle-events = $lifecycleEvents
>   |akka.log-dead-letters = $lifecycleEvents
>   |akka.log-dead-letters-during-shutdown = $lifecycleEvents
>   """.stripMargin))
> {code}
> Actually, we do not have any parameter naming 
> "akka.remote.transport-failure-detector.threshold"
> see: http://doc.akka.io/docs/akka/2.3.4/general/configuration.html
> what we have is "akka.remote.watch-failure-detector.threshold" 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5947) First class partitioning support in data sources API

2015-02-23 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-5947:
-

 Summary: First class partitioning support in data sources API
 Key: SPARK-5947
 URL: https://issues.apache.org/jira/browse/SPARK-5947
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Lian


For file system based data sources, implementing Hive style partitioning 
support can be complex and error prone. To be specific, partitioning support 
include:

# Partition discovery:  Given a directory organized similar to Hive partitions, 
discover the directory structure and partitioning information automatically, 
including partition column names, data types, and values.
# Reading from partitioned tables
# Writing to partitioned tables

It would be good to have first class partitioning support in the data sources 
API. For example, add a {{FileBasedScan}} trait with callbacks and default 
implementations for these features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5948) Support writing to partitioned table for the Parquet data source

2015-02-23 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-5948:
-

 Summary: Support writing to partitioned table for the Parquet data 
source
 Key: SPARK-5948
 URL: https://issues.apache.org/jira/browse/SPARK-5948
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Lian


In 1.3.0, we added support for reading partitioned tables declared in Hive 
metastore for the Parquet data source. However, writing to partitioned tables 
is not supported yet. This feature should probably built upon SPARK-5947.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5940) Graph Loader: refactor + add more formats

2015-02-23 Thread lukovnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333265#comment-14333265
 ] 

lukovnikov commented on SPARK-5940:
---

Probably it's better to involve core Spark/GraphX developers here as well, 
especially for the refactoring part.

Anyway, from my part, I'd love to have loaders (and savers) for different RDF 
formats and (sparse) matrix/tensor formats (for scipy/...).

As for the refactoring, I'd guess it would make sense to have a loader 
interface with a load() method and a facade combining different loader 
interface implementations (so it would have methods like loadNT(), loadEdges(), 
loadTTL(),...).

> Graph Loader: refactor + add more formats
> -
>
> Key: SPARK-5940
> URL: https://issues.apache.org/jira/browse/SPARK-5940
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: lukovnikov
>Priority: Minor
>
> Currently, the only graph loader is GraphLoader.edgeListFile. [SPARK-5280] 
> adds a RDF graph loader.
> However, as Takeshi Yamamuro suggested on github [SPARK-5280], 
> https://github.com/apache/spark/pull/4650, it might be interesting to make 
> GraphLoader an interface with several implementations for different formats. 
> And maybe it's good to make a façade graph loader that provides a unified 
> interface to all loaders.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5949) Driver program has to register roaring bitmap classes used by spark with Kryo when number of partitions is greater than 2000

2015-02-23 Thread Peter Torok (JIRA)
Peter Torok created SPARK-5949:
--

 Summary: Driver program has to register roaring bitmap classes 
used by spark with Kryo when number of partitions is greater than 2000
 Key: SPARK-5949
 URL: https://issues.apache.org/jira/browse/SPARK-5949
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Peter Torok


When more than 2000 partitions are being used with Kryo, the following classes 
need to be registered by driver program:
- org.apache.spark.scheduler.HighlyCompressedMapStatus
- org.roaringbitmap.RoaringBitmap
- org.roaringbitmap.RoaringArray
- org.roaringbitmap.ArrayContainer
- org.roaringbitmap.RoaringArray$Element
- org.roaringbitmap.RoaringArray$Element[]

Our project doesn't have dependency on roaring bitmap and 
HighlyCompressedMapStatus is intended for internal spark usage. Spark should 
take care of this registration when Kryo is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5949) Driver program has to register roaring bitmap classes used by spark with Kryo when number of partitions is greater than 2000

2015-02-23 Thread Peter Torok (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Torok updated SPARK-5949:
---
Description: 
When more than 2000 partitions are being used with Kryo, the following classes 
need to be registered by driver program:
- org.apache.spark.scheduler.HighlyCompressedMapStatus
- org.roaringbitmap.RoaringBitmap
- org.roaringbitmap.RoaringArray
- org.roaringbitmap.ArrayContainer
- org.roaringbitmap.RoaringArray$Element
- org.roaringbitmap.RoaringArray$Element[]
- short[]

Our project doesn't have dependency on roaring bitmap and 
HighlyCompressedMapStatus is intended for internal spark usage. Spark should 
take care of this registration when Kryo is used.

  was:
When more than 2000 partitions are being used with Kryo, the following classes 
need to be registered by driver program:
- org.apache.spark.scheduler.HighlyCompressedMapStatus
- org.roaringbitmap.RoaringBitmap
- org.roaringbitmap.RoaringArray
- org.roaringbitmap.ArrayContainer
- org.roaringbitmap.RoaringArray$Element
- org.roaringbitmap.RoaringArray$Element[]

Our project doesn't have dependency on roaring bitmap and 
HighlyCompressedMapStatus is intended for internal spark usage. Spark should 
take care of this registration when Kryo is used.


> Driver program has to register roaring bitmap classes used by spark with Kryo 
> when number of partitions is greater than 2000
> 
>
> Key: SPARK-5949
> URL: https://issues.apache.org/jira/browse/SPARK-5949
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Peter Torok
>  Labels: kryo, partitioning, serialization
>
> When more than 2000 partitions are being used with Kryo, the following 
> classes need to be registered by driver program:
> - org.apache.spark.scheduler.HighlyCompressedMapStatus
> - org.roaringbitmap.RoaringBitmap
> - org.roaringbitmap.RoaringArray
> - org.roaringbitmap.ArrayContainer
> - org.roaringbitmap.RoaringArray$Element
> - org.roaringbitmap.RoaringArray$Element[]
> - short[]
> Our project doesn't have dependency on roaring bitmap and 
> HighlyCompressedMapStatus is intended for internal spark usage. Spark should 
> take care of this registration when Kryo is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5905) Improve RowMatrix user guide and doc.

2015-02-23 Thread Mike Beyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333417#comment-14333417
 ] 

Mike Beyer commented on SPARK-5905:
---

ok, then we have the same understanding on naming the matrix, - just simply 
mentioning "(rows x columns)" would make this predominant way explicit. I would 
not assume that most people reading this with the intention on just applying ml 
have this present - in particular if you are on RowMatrix and read so many 
statements about n and k. Every piece which is formulated non-ambiguous helps 
to get a more clear understanding.

> Improve RowMatrix user guide and doc.
> -
>
> Key: SPARK-5905
> URL: https://issues.apache.org/jira/browse/SPARK-5905
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> From mbofb's comment in PR https://github.com/apache/spark/pull/4680:
> {code}
> The description of RowMatrix.computeSVD and 
> mllib-dimensionality-reduction.html should be more precise/explicit regarding 
> the m x n matrix. In the current description I would conclude that n refers 
> to the rows. According to 
> http://math.stackexchange.com/questions/191711/how-many-rows-and-columns-are-in-an-m-x-n-matrix
>  this way of describing a matrix is only used in particular domains. I as a 
> reader interested on applying SVD would rather prefer the more common m x n 
> way of rows x columns (e.g. 
> http://en.wikipedia.org/wiki/Matrix_%28mathematics%29 ) which is also used in 
> http://en.wikipedia.org/wiki/Latent_semantic_analysis (and also within the 
> ARPACK manual:
> “
> N Integer. (INPUT) - Dimension of the eigenproblem. 
> NEV Integer. (INPUT) - Number of eigenvalues of OP to be computed. 0 < NEV < 
> N. 
> NCV Integer. (INPUT) - Number of columns of the matrix V (less than or equal 
> to N).
> “
> ).
> description of RowMatrix.computeSVD and mllib-dimensionality-reduction.html:
> "We assume n is smaller than m." Is this just a recommendation or a hard 
> requirement. This condition seems not to be checked and causing an 
> IllegalArgumentException – the processing finishes even though the vectors 
> have a higher dimension than the number of vectors.
> description of RowMatrix. computePrincipalComponents or RowMatrix in general:
> I got a Exception.
> java.lang.IllegalArgumentException: Argument with more than 65535 cols: 
> 7949273
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.checkNumColumns(RowMatrix.scala:131)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:318)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponents(RowMatrix.scala:373)
> This 65535 cols restriction would be nice to be written in the doc (if this 
> still applies in 1.3).
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5939) Make FPGrowth example app take parameters

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5939:
-
Assignee: Jacky Li

> Make FPGrowth example app take parameters
> -
>
> Key: SPARK-5939
> URL: https://issues.apache.org/jira/browse/SPARK-5939
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.1
>Reporter: Jacky Li
>Assignee: Jacky Li
>Priority: Minor
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5939) Make FPGrowth example app take parameters

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5939.
--
Resolution: Fixed

Issue resolved by pull request 4714
[https://github.com/apache/spark/pull/4714]

> Make FPGrowth example app take parameters
> -
>
> Key: SPARK-5939
> URL: https://issues.apache.org/jira/browse/SPARK-5939
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.1
>Reporter: Jacky Li
>Priority: Minor
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5944) Python release docs say SNAPSHOT + Author is missing

2015-02-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333496#comment-14333496
 ] 

Nicholas Chammas commented on SPARK-5944:
-

I'm not sure, but I think [here in the root 
POM|https://github.com/apache/spark/blob/242d49584c6aa21d928db2552033661950f760a5/pom.xml#L29]
 is where you can programmatically fetch the release version. (cc [~srowen] for 
verification)

Also, we should update the [release 
checklist|https://cwiki.apache.org/confluence/display/SPARK/Preparing+Spark+Releases#PreparingSparkReleases-PreparingSparkforRelease]
 so this isn't missed again.

Maybe this is something that goes in [this audit 
script|https://github.com/apache/spark/blob/master/dev/audit-release/audit_release.py]?
 (cc [~pwendell])

> Python release docs say SNAPSHOT + Author is missing
> 
>
> Key: SPARK-5944
> URL: https://issues.apache.org/jira/browse/SPARK-5944
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.2.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> http://spark.apache.org/docs/latest/api/python/index.html
> As of Feb 2015, that link says PySpark 1.2-SNAPSHOT. It should probably say 
> 1.2.1.
> Furthermore, in the footer it says "Copyright 2014, Author." It should 
> probably say something something else or be removed altogether.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5944) Python release docs say SNAPSHOT + Author is missing

2015-02-23 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-5944:

Target Version/s: 1.2.2

> Python release docs say SNAPSHOT + Author is missing
> 
>
> Key: SPARK-5944
> URL: https://issues.apache.org/jira/browse/SPARK-5944
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.2.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> http://spark.apache.org/docs/latest/api/python/index.html
> As of Feb 2015, that link says PySpark 1.2-SNAPSHOT. It should probably say 
> 1.2.1.
> Furthermore, in the footer it says "Copyright 2014, Author." It should 
> probably say something something else or be removed altogether.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5944) Python release docs say SNAPSHOT + Author is missing

2015-02-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-5944:
--
Target Version/s: 1.3.0, 1.2.2  (was: 1.2.2)

> Python release docs say SNAPSHOT + Author is missing
> 
>
> Key: SPARK-5944
> URL: https://issues.apache.org/jira/browse/SPARK-5944
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.2.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> http://spark.apache.org/docs/latest/api/python/index.html
> As of Feb 2015, that link says PySpark 1.2-SNAPSHOT. It should probably say 
> 1.2.1.
> Furthermore, in the footer it says "Copyright 2014, Author." It should 
> probably say something something else or be removed altogether.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5950) Insert array into table saved as parquet should work when using datasource api

2015-02-23 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-5950:
--

 Summary: Insert array into table saved as parquet should work when 
using datasource api
 Key: SPARK-5950
 URL: https://issues.apache.org/jira/browse/SPARK-5950
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5950) Insert array into table saved as parquet should work when using datasource api

2015-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333532#comment-14333532
 ] 

Apache Spark commented on SPARK-5950:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/4729

> Insert array into table saved as parquet should work when using datasource api
> --
>
> Key: SPARK-5950
> URL: https://issues.apache.org/jira/browse/SPARK-5950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5951) Remove unreachable driver memory properties in yarn client mode (YarnClientSchedulerBackend)

2015-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333568#comment-14333568
 ] 

Apache Spark commented on SPARK-5951:
-

User 'zuxqoj' has created a pull request for this issue:
https://github.com/apache/spark/pull/4730

> Remove unreachable driver memory properties in yarn client mode 
> (YarnClientSchedulerBackend)
> 
>
> Key: SPARK-5951
> URL: https://issues.apache.org/jira/browse/SPARK-5951
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
> Environment: yarn
>Reporter: Shekhar Bansal
>Priority: Trivial
> Fix For: 1.3.0
>
>
> In SPARK-4730 warning for deprecated was added
> and in SPARK-1953 driver memory configs were removed in yarn client mode
> During integration spark.master.memory and SPARK_MASTER_MEMORY were not 
> removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5951) Remove unreachable driver memory properties in yarn client mode (YarnClientSchedulerBackend)

2015-02-23 Thread Shekhar Bansal (JIRA)
Shekhar Bansal created SPARK-5951:
-

 Summary: Remove unreachable driver memory properties in yarn 
client mode (YarnClientSchedulerBackend)
 Key: SPARK-5951
 URL: https://issues.apache.org/jira/browse/SPARK-5951
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0
 Environment: yarn
Reporter: Shekhar Bansal
Priority: Trivial
 Fix For: 1.3.0


In SPARK-4730 warning for deprecated was added
and in SPARK-1953 driver memory configs were removed in yarn client mode

During integration spark.master.memory and SPARK_MASTER_MEMORY were not removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3511) Create a RELEASE-NOTES.txt file in the repo

2015-02-23 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3511.

Resolution: Won't Fix

Never ended up doing this. It's stale so I'm just gonna remove it.

> Create a RELEASE-NOTES.txt file in the repo
> ---
>
> Key: SPARK-3511
> URL: https://issues.apache.org/jira/browse/SPARK-3511
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
>
> There are a few different things we need to do a better job of tracking. This 
> file would allow us to track things:
> 1. When we want to give credit to secondary people for contributing to a patch
> 2. Changes to default configuration values w/ how to restore legacy options
> 3. New features that are disabled by default
> 4. Known API breaks (if any) along w/ explanation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5463) Fix Parquet filter push-down

2015-02-23 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5463:
---
Priority: Critical  (was: Blocker)

> Fix Parquet filter push-down
> 
>
> Key: SPARK-5463
> URL: https://issues.apache.org/jira/browse/SPARK-5463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1, 1.2.2
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3650) Triangle Count handles reverse edges incorrectly

2015-02-23 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3650:
---
Priority: Critical  (was: Blocker)

> Triangle Count handles reverse edges incorrectly
> 
>
> Key: SPARK-3650
> URL: https://issues.apache.org/jira/browse/SPARK-3650
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Joseph E. Gonzalez
>Priority: Critical
>
> The triangle count implementation assumes that edges are aligned in a 
> canonical direction.  As stated in the documentation:
> bq. Note that the input graph should have its edges in canonical direction 
> (i.e. the `sourceId` less than `destId`)
> However the TriangleCount algorithm does not verify that this condition holds 
> and indeed even the unit tests exploits this functionality:
> {code:scala}
> val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++
> Array(0L -> -1L, -1L -> -2L, -2L -> 0L)
>   val rawEdges = sc.parallelize(triangles, 2)
>   val graph = Graph.fromEdgeTuples(rawEdges, true).cache()
>   val triangleCount = graph.triangleCount()
>   val verts = triangleCount.vertices
>   verts.collect().foreach { case (vid, count) =>
> if (vid == 0) {
>   assert(count === 4)  // <-- Should be 2
> } else {
>   assert(count === 2) // <-- Should be 1
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5904) DataFrame methods with varargs do not work in Java

2015-02-23 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5904.

   Resolution: Fixed
Fix Version/s: 1.3.0

I think rxin just forgot to close this. It was merged several days ago.

> DataFrame methods with varargs do not work in Java
> --
>
> Key: SPARK-5904
> URL: https://issues.apache.org/jira/browse/SPARK-5904
> Project: Spark
>  Issue Type: Sub-task
>  Components: Java API, SQL
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Reynold Xin
>Priority: Blocker
>  Labels: DataFrame
> Fix For: 1.3.0
>
>
> DataFrame methods with varargs fail when called from Java due to a bug in 
> Scala.
> This can be produced by, e.g., modifying the end of the example 
> ml.JavaSimpleParamsExample in the master branch:
> {code}
> DataFrame results = model2.transform(test);
> results.printSchema(); // works
> results.collect(); // works
> results.filter("label > 0.0").count(); // works
> for (Row r: results.select("features", "label", "myProbability", 
> "prediction").collect()) { // fails on select
>   System.out.println("(" + r.get(0) + ", " + r.get(1) + ") -> prob=" + 
> r.get(2)
>   + ", prediction=" + r.get(3));
> }
> {code}
> I have also tried groupBy and found that failed too.
> The error looks like this:
> {code}
> Exception in thread "main" java.lang.AbstractMethodError: 
> org.apache.spark.sql.DataFrameImpl.groupBy(Ljava/lang/String;[Ljava/lang/String;)Lorg/apache/spark/sql/GroupedData;
>   at 
> org.apache.spark.examples.ml.JavaSimpleParamsExample.main(JavaSimpleParamsExample.java:108)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> The error appears to be from this Scala bug with using varargs in an abstract 
> method:
> [https://issues.scala-lang.org/browse/SI-9013]
> My current plan is to move the implementations of the methods with varargs 
> from DataFrameImpl to DataFrame.
> However, this may cause issues with IncomputableColumn---feedback??
> Thanks to [~joshrosen] for figuring the bug and fix out!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5463) Fix Parquet filter push-down

2015-02-23 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333608#comment-14333608
 ] 

Patrick Wendell commented on SPARK-5463:


Bumping to critical. Per our offline discussion last week we probably won't 
hold the release for this.

> Fix Parquet filter push-down
> 
>
> Key: SPARK-5463
> URL: https://issues.apache.org/jira/browse/SPARK-5463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1, 1.2.2
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-765) Test suite should run Spark example programs

2015-02-23 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333682#comment-14333682
 ] 

Josh Rosen commented on SPARK-765:
--

Yep, this still needs to be done.  It's more of an issue for the Python 
examples than the Scala / Java ones, since at least the JVM ones are guaranteed 
to compile.  It's unlikely that the examples are broken, though, given that we 
have compatibility guarantees, so this is probably a lower priority relative to 
other test automation tasks.

> Test suite should run Spark example programs
> 
>
> Key: SPARK-765
> URL: https://issues.apache.org/jira/browse/SPARK-765
> Project: Spark
>  Issue Type: New Feature
>  Components: Examples
>Reporter: Josh Rosen
>
> The Spark test suite should also run each of the Spark example programs (the 
> PySpark suite should do the same).  This should be done through a shell 
> script or other mechanism to simulate the environment setup used by end users 
> that run those scripts.
> This would prevent problems like SPARK-764 from making it into releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5750) Document that ordering of elements in shuffled partitions is not deterministic across runs

2015-02-23 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333685#comment-14333685
 ] 

Ilya Ganelin commented on SPARK-5750:
-

Hi Josh - I can knock this out. Thanks.

> Document that ordering of elements in shuffled partitions is not 
> deterministic across runs
> --
>
> Key: SPARK-5750
> URL: https://issues.apache.org/jira/browse/SPARK-5750
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Josh Rosen
>
> The ordering of elements in shuffled partitions is not deterministic across 
> runs.  For instance, consider the following example:
> {code}
> val largeFiles = sc.textFile(...)
> val airlines = largeFiles.repartition(2000).cache()
> println(airlines.first)
> {code}
> If this code is run twice, then each run will output a different result.  
> There is non-determinism in the shuffle read code that accounts for this:
> Spark's shuffle read path processes blocks as soon as they are fetched  Spark 
> uses 
> [ShuffleBlockFetcherIterator|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala]
>  to fetch shuffle data from mappers.  In this code, requests for multiple 
> blocks from the same host are batched together, so nondeterminism in where 
> tasks are run means that the set of requests can vary across runs.  In 
> addition, there's an [explicit 
> call|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L256]
>  to randomize the order of the batched fetch requests.  As a result, shuffle 
> operations cannot be guaranteed to produce the same ordering of the elements 
> in their partitions.
> Therefore, Spark should update its docs to clarify that the ordering of 
> elements in shuffle RDDs' partitions is non-deterministic.  Note, however, 
> that the _set_ of elements in each partition will be deterministic: if we 
> used {{mapPartitions}} to sort each partition, then the {{first()}} call 
> above would produce a deterministic result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5845) Time to cleanup intermediate shuffle files not included in shuffle write time

2015-02-23 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333691#comment-14333691
 ] 

Ilya Ganelin commented on SPARK-5845:
-

Hi Kay - I can knock this one out. Thanks. 

> Time to cleanup intermediate shuffle files not included in shuffle write time
> -
>
> Key: SPARK-5845
> URL: https://issues.apache.org/jira/browse/SPARK-5845
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Kay Ousterhout
>Priority: Minor
>
> When the disk is contended, I've observed cases when it takes as long as 7 
> seconds to clean up all of the intermediate spill files for a shuffle (when 
> using the sort based shuffle, but bypassing merging because there are <=200 
> shuffle partitions).  This is even when the shuffle data is non-huge (152MB 
> written from one of the tasks where I observed this).  This is effectively 
> part of the shuffle write time (because it's a necessary side effect of 
> writing data to disk) so should be added to the shuffle write time to 
> facilitate debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5944) Python release docs say SNAPSHOT + Author is missing

2015-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333699#comment-14333699
 ] 

Apache Spark commented on SPARK-5944:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4731

> Python release docs say SNAPSHOT + Author is missing
> 
>
> Key: SPARK-5944
> URL: https://issues.apache.org/jira/browse/SPARK-5944
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.2.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> http://spark.apache.org/docs/latest/api/python/index.html
> As of Feb 2015, that link says PySpark 1.2-SNAPSHOT. It should probably say 
> 1.2.1.
> Furthermore, in the footer it says "Copyright 2014, Author." It should 
> probably say something something else or be removed altogether.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5845) Time to cleanup intermediate shuffle files not included in shuffle write time

2015-02-23 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-5845:
--
Assignee: Ilya Ganelin

> Time to cleanup intermediate shuffle files not included in shuffle write time
> -
>
> Key: SPARK-5845
> URL: https://issues.apache.org/jira/browse/SPARK-5845
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Kay Ousterhout
>Assignee: Ilya Ganelin
>Priority: Minor
>
> When the disk is contended, I've observed cases when it takes as long as 7 
> seconds to clean up all of the intermediate spill files for a shuffle (when 
> using the sort based shuffle, but bypassing merging because there are <=200 
> shuffle partitions).  This is even when the shuffle data is non-huge (152MB 
> written from one of the tasks where I observed this).  This is effectively 
> part of the shuffle write time (because it's a necessary side effect of 
> writing data to disk) so should be added to the shuffle write time to 
> facilitate debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5912) Programming guide for feature selection

2015-02-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-5912.
--
   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Alexander Ulanov

> Programming guide for feature selection
> ---
>
> Key: SPARK-5912
> URL: https://issues.apache.org/jira/browse/SPARK-5912
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Alexander Ulanov
> Fix For: 1.3.0
>
>
> The new ChiSqSelector for feature selection should have a section in the 
> Programming Guide.  It should probably be under the feature extraction and 
> transformation section as a new subsection for feature selection.
> If we get more feature selection methods later on, we could expand it to a 
> larger section of the guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5079) Detect failed jobs / batches in Spark Streaming unit tests

2015-02-23 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333743#comment-14333743
 ] 

Ilya Ganelin commented on SPARK-5079:
-

Hi [~joshrosen] - I'm trying to wrap my head around the unit tests trying to 
find some specific tests where this is a problem as a baseline. If you could 
highlight a couple of examples as a starting point that would help a lot. 
Thanks!

> Detect failed jobs / batches in Spark Streaming unit tests
> --
>
> Key: SPARK-5079
> URL: https://issues.apache.org/jira/browse/SPARK-5079
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Josh Rosen
>Assignee: Ilya Ganelin
>
> Currently, it is possible to write Spark Streaming unit tests where Spark 
> jobs fail but the streaming tests succeed because we rely on wall-clock time 
> plus output comparision in order to check whether a test has passed, and 
> hence may miss cases where errors occurred if they didn't affect these 
> results.  We should strengthen the tests to check that no job failures 
> occurred while processing batches.
> See https://github.com/apache/spark/pull/3832#issuecomment-68580794 for 
> additional context.
> The StreamingTestWaiter in https://github.com/apache/spark/pull/3801 might 
> also fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5952) Failure to lock metastore client in tableExists()

2015-02-23 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-5952:
---

 Summary: Failure to lock metastore client in tableExists()
 Key: SPARK-5952
 URL: https://issues.apache.org/jira/browse/SPARK-5952
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5953) NoSuchMethodException with a Kafka input stream and custom decoder in Scala

2015-02-23 Thread Aleksandar Stojadinovic (JIRA)
Aleksandar Stojadinovic created SPARK-5953:
--

 Summary: NoSuchMethodException with a Kafka input stream and 
custom decoder in Scala
 Key: SPARK-5953
 URL: https://issues.apache.org/jira/browse/SPARK-5953
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Spark Core
Affects Versions: 1.2.1, 1.2.0
 Environment: Xubuntu 14.04, Kafka 0.8.2, Scala 2.10.4, Scala 2.11.5
Reporter: Aleksandar Stojadinovic


When using a Kafka input stream, and setting a custom Kafka Decoder, Spark 
throws an exception upon starting:
{noformat}
ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting 
receiver 0 - java.lang.NoSuchMethodException: 
UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
at java.lang.Class.getConstructor0(Class.java:2971)
at java.lang.Class.getConstructor(Class.java:1812)
at 
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

15/02/23 21:37:31 ERROR ReceiverSupervisorImpl: Stopped executor with error: 
java.lang.NoSuchMethodException: 
UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
15/02/23 21:37:31 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoSuchMethodException: 
UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
at java.lang.Class.getConstructor0(Class.java:2971)
at java.lang.Class.getConstructor(Class.java:1812)
at 
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

The stream is initialized with:

{code:title=Bar.java|borderStyle=solid}
 val locationsAndKeys = KafkaUtils.createStream[String, Array[Byte], 
kafka.serializer.StringDecoder, UserLocationEventDecoder] (ssc, kafkaParams, 
topicMap, StorageLevel.MEMORY_AND_DISK);
{code}

The decoder:
{code:title=Bar.java|borderStyle=solid}

import kafka.serializer.Decoder
class UserLocationEventDecoder extends Decoder[UserLocationEvent] {

  val kryo = new Kryo()

  override def fromBytes(bytes: Array[Byte]): UserLocationEvent = {
val input: Input = new Input(new ByteArrayInputStream(bytes))
val userLocationEvent: UserLocationEvent = 
kryo.readClassAndObject(input).asInstanceOf[UserLocationEvent]
input.close()
return userLocationEvent
  }
}
{code}

The input stream (and my code overall) works fine if initialized with the 
kafka.serializer.DefaultDecoder, and content is manually deserialized. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apac

[jira] [Updated] (SPARK-5953) NoSuchMethodException with a Kafka input stream and custom decoder in Scala

2015-02-23 Thread Aleksandar Stojadinovic (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksandar Stojadinovic updated SPARK-5953:
---
Description: 
When using a Kafka input stream, and setting a custom Kafka Decoder, Spark 
throws an exception upon starting:
{noformat}
ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting 
receiver 0 - java.lang.NoSuchMethodException: 
UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
at java.lang.Class.getConstructor0(Class.java:2971)
at java.lang.Class.getConstructor(Class.java:1812)
at 
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

15/02/23 21:37:31 ERROR ReceiverSupervisorImpl: Stopped executor with error: 
java.lang.NoSuchMethodException: 
UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
15/02/23 21:37:31 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoSuchMethodException: 
UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
at java.lang.Class.getConstructor0(Class.java:2971)
at java.lang.Class.getConstructor(Class.java:1812)
at 
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

The stream is initialized with:

{code:title=Main.scala|borderStyle=solid}
 val locationsAndKeys = KafkaUtils.createStream[String, Array[Byte], 
kafka.serializer.StringDecoder, UserLocationEventDecoder] (ssc, kafkaParams, 
topicMap, StorageLevel.MEMORY_AND_DISK);
{code}

The decoder:
{code:title=UserLocationEventDecoder.scala|borderStyle=solid}

import kafka.serializer.Decoder
class UserLocationEventDecoder extends Decoder[UserLocationEvent] {

  val kryo = new Kryo()

  override def fromBytes(bytes: Array[Byte]): UserLocationEvent = {
val input: Input = new Input(new ByteArrayInputStream(bytes))
val userLocationEvent: UserLocationEvent = 
kryo.readClassAndObject(input).asInstanceOf[UserLocationEvent]
input.close()
return userLocationEvent
  }
}
{code}

The input stream (and my code overall) works fine if initialized with the 
kafka.serializer.DefaultDecoder, and content is manually deserialized. 

  was:
When using a Kafka input stream, and setting a custom Kafka Decoder, Spark 
throws an exception upon starting:
{noformat}
ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting 
receiver 0 - java.lang.NoSuchMethodException: 
UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
at java.lang.Class.getConstructor0(Class.java:2971)
at java.lang.Class.getConstructor(Class.java:1812)
at 
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
at 
org.a

[jira] [Updated] (SPARK-5953) NoSuchMethodException with a Kafka input stream and custom decoder in Scala

2015-02-23 Thread Aleksandar Stojadinovic (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksandar Stojadinovic updated SPARK-5953:
---
Description: 
When using a Kafka input stream, and setting a custom Kafka Decoder, Spark 
throws an exception upon starting:
{noformat}
ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting 
receiver 0 - java.lang.NoSuchMethodException: 
UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
at java.lang.Class.getConstructor0(Class.java:2971)
at java.lang.Class.getConstructor(Class.java:1812)
at 
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

15/02/23 21:37:31 ERROR ReceiverSupervisorImpl: Stopped executor with error: 
java.lang.NoSuchMethodException: 
UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
15/02/23 21:37:31 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoSuchMethodException: 
UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
at java.lang.Class.getConstructor0(Class.java:2971)
at java.lang.Class.getConstructor(Class.java:1812)
at 
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

The stream is initialized with:

{code:borderStyle=solid}
 val locationsAndKeys = KafkaUtils.createStream[String, Array[Byte], 
kafka.serializer.StringDecoder, UserLocationEventDecoder] (ssc, kafkaParams, 
topicMap, StorageLevel.MEMORY_AND_DISK);
{code}

The decoder:
{code:borderStyle=solid}

import kafka.serializer.Decoder
class UserLocationEventDecoder extends Decoder[UserLocationEvent] {

  val kryo = new Kryo()

  override def fromBytes(bytes: Array[Byte]): UserLocationEvent = {
val input: Input = new Input(new ByteArrayInputStream(bytes))
val userLocationEvent: UserLocationEvent = 
kryo.readClassAndObject(input).asInstanceOf[UserLocationEvent]
input.close()
return userLocationEvent
  }
}
{code}

The input stream (and my code overall) works fine if initialized with the 
kafka.serializer.DefaultDecoder, and content is manually deserialized. 

  was:
When using a Kafka input stream, and setting a custom Kafka Decoder, Spark 
throws an exception upon starting:
{noformat}
ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting 
receiver 0 - java.lang.NoSuchMethodException: 
UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
at java.lang.Class.getConstructor0(Class.java:2971)
at java.lang.Class.getConstructor(Class.java:1812)
at 
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.star

[jira] [Commented] (SPARK-5953) NoSuchMethodException with a Kafka input stream and custom decoder in Scala

2015-02-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333799#comment-14333799
 ] 

Sean Owen commented on SPARK-5953:
--

Dumb question, is it on the classpath? in your app JAR? I'm wondering if Kafka 
can see your user classpath, if so

> NoSuchMethodException with a Kafka input stream and custom decoder in Scala
> ---
>
> Key: SPARK-5953
> URL: https://issues.apache.org/jira/browse/SPARK-5953
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 1.2.0, 1.2.1
> Environment: Xubuntu 14.04, Kafka 0.8.2, Scala 2.10.4, Scala 2.11.5
>Reporter: Aleksandar Stojadinovic
>
> When using a Kafka input stream, and setting a custom Kafka Decoder, Spark 
> throws an exception upon starting:
> {noformat}
> ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting 
> receiver 0 - java.lang.NoSuchMethodException: 
> UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
>   at java.lang.Class.getConstructor0(Class.java:2971)
>   at java.lang.Class.getConstructor(Class.java:1812)
>   at 
> org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 15/02/23 21:37:31 ERROR ReceiverSupervisorImpl: Stopped executor with error: 
> java.lang.NoSuchMethodException: 
> UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
> 15/02/23 21:37:31 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NoSuchMethodException: 
> UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
>   at java.lang.Class.getConstructor0(Class.java:2971)
>   at java.lang.Class.getConstructor(Class.java:1812)
>   at 
> org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The stream is initialized with:
> {code:title=Main.scala|borderStyle=solid}
>  val locationsAndKeys = KafkaUtils.createStream[String, Array[Byte], 
> kafka.serializer.StringDecoder, UserLocationEventDecoder] (ssc, kafkaParams, 
> topicMap, StorageLevel.MEMORY_AND_DISK);
> {code}
> The decoder:
> {code:title=UserLocationEventDecoder.scala|borderStyle=solid}
> import kafka.serializer.Decoder
> class UserLocationEventDecoder extends Decoder[UserLocationEvent] {
>   val kryo = new Kryo()
>   override def fromBytes(bytes: Array[Byte]): UserLocationEvent = {
> val input: Input = new Input(new ByteArrayInputStream(bytes))
> val userLocationEvent: UserLocationEvent = 
> kryo.readClassAndObject(input).asInstanceOf[UserLocationEvent]
> in

[jira] [Updated] (SPARK-5953) NoSuchMethodException with a Kafka input stream and custom decoder in Scala

2015-02-23 Thread Aleksandar Stojadinovic (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksandar Stojadinovic updated SPARK-5953:
---
Description: 
When using a Kafka input stream, and setting a custom Kafka Decoder, Spark 
throws an exception upon starting:
{noformat}
ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting 
receiver 0 - java.lang.NoSuchMethodException: 
UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
at java.lang.Class.getConstructor0(Class.java:2971)
at java.lang.Class.getConstructor(Class.java:1812)
at 
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

15/02/23 21:37:31 ERROR ReceiverSupervisorImpl: Stopped executor with error: 
java.lang.NoSuchMethodException: 
UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
15/02/23 21:37:31 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoSuchMethodException: 
UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
at java.lang.Class.getConstructor0(Class.java:2971)
at java.lang.Class.getConstructor(Class.java:1812)
at 
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

The stream is initialized with:

{code:title=Main.scala|borderStyle=solid}
 val locationsAndKeys = KafkaUtils.createStream[String, Array[Byte], 
kafka.serializer.StringDecoder, UserLocationEventDecoder] (ssc, kafkaParams, 
topicMap, StorageLevel.MEMORY_AND_DISK);
{code}

The decoder:
{code:title=UserLocationEventDecoder.scala|borderStyle=solid}

import kafka.serializer.Decoder
class UserLocationEventDecoder extends Decoder[UserLocationEvent] {

  val kryo = new Kryo()

  override def fromBytes(bytes: Array[Byte]): UserLocationEvent = {
val input: Input = new Input(new ByteArrayInputStream(bytes))
val userLocationEvent: UserLocationEvent = 
kryo.readClassAndObject(input).asInstanceOf[UserLocationEvent]
input.close()
return userLocationEvent
  }
}
{code}

{code: build.sbt|birderStyle=solid}
scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.2.1"

libraryDependencies += "com.spatial4j" % "spatial4j" % "0.4.1"

libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.2.1"

libraryDependencies += "org.apache.spark" % "spark-streaming-kafka_2.10" % 
"1.2.1"

libraryDependencies += "com.twitter" % "chill_2.10" % "0.5.2"
{code}
The input stream (and my code overall) works fine if initialized with the 
kafka.serializer.DefaultDecoder, and content is manually deserialized. 

  was:
When using a Kafka input stream, and setting a custom Kafka Decoder, Spark 
throws an exception upon star

[jira] [Updated] (SPARK-5953) NoSuchMethodException with a Kafka input stream and custom decoder in Scala

2015-02-23 Thread Aleksandar Stojadinovic (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksandar Stojadinovic updated SPARK-5953:
---
Description: 
When using a Kafka input stream, and setting a custom Kafka Decoder, Spark 
throws an exception upon starting:
{noformat}
ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting 
receiver 0 - java.lang.NoSuchMethodException: 
UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
at java.lang.Class.getConstructor0(Class.java:2971)
at java.lang.Class.getConstructor(Class.java:1812)
at 
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

15/02/23 21:37:31 ERROR ReceiverSupervisorImpl: Stopped executor with error: 
java.lang.NoSuchMethodException: 
UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
15/02/23 21:37:31 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoSuchMethodException: 
UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
at java.lang.Class.getConstructor0(Class.java:2971)
at java.lang.Class.getConstructor(Class.java:1812)
at 
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

The stream is initialized with:

{code:title=Main.scala|borderStyle=solid}
 val locationsAndKeys = KafkaUtils.createStream[String, Array[Byte], 
kafka.serializer.StringDecoder, UserLocationEventDecoder] (ssc, kafkaParams, 
topicMap, StorageLevel.MEMORY_AND_DISK);
{code}

The decoder:
{code:title=UserLocationEventDecoder.scala|borderStyle=solid}

import kafka.serializer.Decoder
class UserLocationEventDecoder extends Decoder[UserLocationEvent] {

  val kryo = new Kryo()

  override def fromBytes(bytes: Array[Byte]): UserLocationEvent = {
val input: Input = new Input(new ByteArrayInputStream(bytes))
val userLocationEvent: UserLocationEvent = 
kryo.readClassAndObject(input).asInstanceOf[UserLocationEvent]
input.close()
return userLocationEvent
  }
}
{code}

build.sbt:
{code:borderStyle=solid}
scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.2.1"

libraryDependencies += "com.spatial4j" % "spatial4j" % "0.4.1"

libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.2.1"

libraryDependencies += "org.apache.spark" % "spark-streaming-kafka_2.10" % 
"1.2.1"

libraryDependencies += "com.twitter" % "chill_2.10" % "0.5.2"
{code}
The input stream (and my code overall) works fine if initialized with the 
kafka.serializer.DefaultDecoder, and content is manually deserialized. 

  was:
When using a Kafka input stream, and setting a custom Kafka Decoder, Spark 
throws an exception upon star

[jira] [Commented] (SPARK-5953) NoSuchMethodException with a Kafka input stream and custom decoder in Scala

2015-02-23 Thread Aleksandar Stojadinovic (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333808#comment-14333808
 ] 

Aleksandar Stojadinovic commented on SPARK-5953:


The decoder? Yes, it's a part of the application.

> NoSuchMethodException with a Kafka input stream and custom decoder in Scala
> ---
>
> Key: SPARK-5953
> URL: https://issues.apache.org/jira/browse/SPARK-5953
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 1.2.0, 1.2.1
> Environment: Xubuntu 14.04, Kafka 0.8.2, Scala 2.10.4, Scala 2.11.5
>Reporter: Aleksandar Stojadinovic
>
> When using a Kafka input stream, and setting a custom Kafka Decoder, Spark 
> throws an exception upon starting:
> {noformat}
> ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting 
> receiver 0 - java.lang.NoSuchMethodException: 
> UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
>   at java.lang.Class.getConstructor0(Class.java:2971)
>   at java.lang.Class.getConstructor(Class.java:1812)
>   at 
> org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 15/02/23 21:37:31 ERROR ReceiverSupervisorImpl: Stopped executor with error: 
> java.lang.NoSuchMethodException: 
> UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
> 15/02/23 21:37:31 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NoSuchMethodException: 
> UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
>   at java.lang.Class.getConstructor0(Class.java:2971)
>   at java.lang.Class.getConstructor(Class.java:1812)
>   at 
> org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The stream is initialized with:
> {code:title=Main.scala|borderStyle=solid}
>  val locationsAndKeys = KafkaUtils.createStream[String, Array[Byte], 
> kafka.serializer.StringDecoder, UserLocationEventDecoder] (ssc, kafkaParams, 
> topicMap, StorageLevel.MEMORY_AND_DISK);
> {code}
> The decoder:
> {code:title=UserLocationEventDecoder.scala|borderStyle=solid}
> import kafka.serializer.Decoder
> class UserLocationEventDecoder extends Decoder[UserLocationEvent] {
>   val kryo = new Kryo()
>   override def fromBytes(bytes: Array[Byte]): UserLocationEvent = {
> val input: Input = new Input(new ByteArrayInputStream(bytes))
> val userLocationEvent: UserLocationEvent = 
> kryo.readClassAndObject(input).asInstanceOf[UserLocationEvent]
> input.close()
> return userLocation

[jira] [Commented] (SPARK-5953) NoSuchMethodException with a Kafka input stream and custom decoder in Scala

2015-02-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333818#comment-14333818
 ] 

Sean Owen commented on SPARK-5953:
--

YARN or standalone?
Did you look into spark.yarn.user.classpath.first and/or 
spark.files.userClassPathFirst ? If it's a classloader visibility thing these 
could be the right way to use your decoder.

> NoSuchMethodException with a Kafka input stream and custom decoder in Scala
> ---
>
> Key: SPARK-5953
> URL: https://issues.apache.org/jira/browse/SPARK-5953
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 1.2.0, 1.2.1
> Environment: Xubuntu 14.04, Kafka 0.8.2, Scala 2.10.4, Scala 2.11.5
>Reporter: Aleksandar Stojadinovic
>
> When using a Kafka input stream, and setting a custom Kafka Decoder, Spark 
> throws an exception upon starting:
> {noformat}
> ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting 
> receiver 0 - java.lang.NoSuchMethodException: 
> UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
>   at java.lang.Class.getConstructor0(Class.java:2971)
>   at java.lang.Class.getConstructor(Class.java:1812)
>   at 
> org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 15/02/23 21:37:31 ERROR ReceiverSupervisorImpl: Stopped executor with error: 
> java.lang.NoSuchMethodException: 
> UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
> 15/02/23 21:37:31 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NoSuchMethodException: 
> UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
>   at java.lang.Class.getConstructor0(Class.java:2971)
>   at java.lang.Class.getConstructor(Class.java:1812)
>   at 
> org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The stream is initialized with:
> {code:title=Main.scala|borderStyle=solid}
>  val locationsAndKeys = KafkaUtils.createStream[String, Array[Byte], 
> kafka.serializer.StringDecoder, UserLocationEventDecoder] (ssc, kafkaParams, 
> topicMap, StorageLevel.MEMORY_AND_DISK);
> {code}
> The decoder:
> {code:title=UserLocationEventDecoder.scala|borderStyle=solid}
> import kafka.serializer.Decoder
> class UserLocationEventDecoder extends Decoder[UserLocationEvent] {
>   val kryo = new Kryo()
>   override def fromBytes(bytes: Array[Byte]): UserLocationEvent = {
> val input: Input = new Input(new ByteArrayInputStream(bytes))
> val userLocationEvent: UserLoca

[jira] [Commented] (SPARK-5953) NoSuchMethodException with a Kafka input stream and custom decoder in Scala

2015-02-23 Thread Aleksandar Stojadinovic (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333828#comment-14333828
 ] 

Aleksandar Stojadinovic commented on SPARK-5953:


Standalone, in local, in my IDE (IntelliJ IDEA 14.0.3 with updated plugins). 
This is just a test, and one of my first real Spark apps so there is always a 
possibility that I'm doing something wrong, but I'm certainly not using YARN 
:-).

> NoSuchMethodException with a Kafka input stream and custom decoder in Scala
> ---
>
> Key: SPARK-5953
> URL: https://issues.apache.org/jira/browse/SPARK-5953
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 1.2.0, 1.2.1
> Environment: Xubuntu 14.04, Kafka 0.8.2, Scala 2.10.4, Scala 2.11.5
>Reporter: Aleksandar Stojadinovic
>
> When using a Kafka input stream, and setting a custom Kafka Decoder, Spark 
> throws an exception upon starting:
> {noformat}
> ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting 
> receiver 0 - java.lang.NoSuchMethodException: 
> UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
>   at java.lang.Class.getConstructor0(Class.java:2971)
>   at java.lang.Class.getConstructor(Class.java:1812)
>   at 
> org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 15/02/23 21:37:31 ERROR ReceiverSupervisorImpl: Stopped executor with error: 
> java.lang.NoSuchMethodException: 
> UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
> 15/02/23 21:37:31 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NoSuchMethodException: 
> UserLocationEventDecoder.(kafka.utils.VerifiableProperties)
>   at java.lang.Class.getConstructor0(Class.java:2971)
>   at java.lang.Class.getConstructor(Class.java:1812)
>   at 
> org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:277)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$8.apply(ReceiverTracker.scala:269)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The stream is initialized with:
> {code:title=Main.scala|borderStyle=solid}
>  val locationsAndKeys = KafkaUtils.createStream[String, Array[Byte], 
> kafka.serializer.StringDecoder, UserLocationEventDecoder] (ssc, kafkaParams, 
> topicMap, StorageLevel.MEMORY_AND_DISK);
> {code}
> The decoder:
> {code:title=UserLocationEventDecoder.scala|borderStyle=solid}
> import kafka.serializer.Decoder
> class UserLocationEventDecoder extends Decoder[UserLocationEvent] {
>   val kryo = new Kryo()
>   override def fromBytes(bytes: Array[Byte]): UserLocationEvent = {
> val input: Input = new Input(new

[jira] [Commented] (SPARK-4144) Support incremental model training of Naive Bayes classifier

2015-02-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333830#comment-14333830
 ] 

Xiangrui Meng commented on SPARK-4144:
--

[~freeman-lab] I've assigned this ticket to you. I believe [~liquanpei] is 
quite busy now.

> Support incremental model training of Naive Bayes classifier
> 
>
> Key: SPARK-4144
> URL: https://issues.apache.org/jira/browse/SPARK-4144
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Streaming
>Reporter: Chris Fregly
>Assignee: Jeremy Freeman
>
> Per Xiangrui Meng from the following user list discussion:  
> http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E
>
> "For Naive Bayes, we need to update the priors and conditional
> probabilities, which means we should also remember the number of
> observations for the updates."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4144) Support incremental model training of Naive Bayes classifier

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4144:
-
Assignee: Jeremy Freeman  (was: Liquan Pei)

> Support incremental model training of Naive Bayes classifier
> 
>
> Key: SPARK-4144
> URL: https://issues.apache.org/jira/browse/SPARK-4144
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Streaming
>Reporter: Chris Fregly
>Assignee: Jeremy Freeman
>
> Per Xiangrui Meng from the following user list discussion:  
> http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E
>
> "For Naive Bayes, we need to update the priors and conditional
> probabilities, which means we should also remember the number of
> observations for the updates."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4144) Support incremental model training of Naive Bayes classifier

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4144:
-
Target Version/s: 1.4.0

> Support incremental model training of Naive Bayes classifier
> 
>
> Key: SPARK-4144
> URL: https://issues.apache.org/jira/browse/SPARK-4144
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Streaming
>Reporter: Chris Fregly
>Assignee: Jeremy Freeman
>
> Per Xiangrui Meng from the following user list discussion:  
> http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E
>
> "For Naive Bayes, we need to update the priors and conditional
> probabilities, which means we should also remember the number of
> observations for the updates."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5912) Programming guide for feature selection

2015-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333832#comment-14333832
 ] 

Apache Spark commented on SPARK-5912:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/4732

> Programming guide for feature selection
> ---
>
> Key: SPARK-5912
> URL: https://issues.apache.org/jira/browse/SPARK-5912
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Alexander Ulanov
> Fix For: 1.3.0
>
>
> The new ChiSqSelector for feature selection should have a section in the 
> Programming Guide.  It should probably be under the feature extraction and 
> transformation section as a new subsection for feature selection.
> If we get more feature selection methods later on, we could expand it to a 
> larger section of the guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5954) Add topByKey to pair RDDs

2015-02-23 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5954:


 Summary: Add topByKey to pair RDDs
 Key: SPARK-5954
 URL: https://issues.apache.org/jira/browse/SPARK-5954
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Xiangrui Meng


`topByKey(num: Int): RDD[(K, V)]` finds the top-k values for each key in a pair 
RDD. This is used, e.g., in computing top recommendations. We can use the Guava 
implementation of finding top-k from an iterator. See also 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Utils.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5922) Add diff(other: RDD[VertexId, VD]) in VertexRDD

2015-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333875#comment-14333875
 ] 

Apache Spark commented on SPARK-5922:
-

User 'brennonyork' has created a pull request for this issue:
https://github.com/apache/spark/pull/4733

> Add diff(other: RDD[VertexId, VD]) in VertexRDD
> ---
>
> Key: SPARK-5922
> URL: https://issues.apache.org/jira/browse/SPARK-5922
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Takeshi Yamamuro
>Priority: Trivial
>
> Add diff(other: RDD[VertexId, VD]) in VertexRDD and this api is the same with
> VertexRDD#leftJoin and VertexRDD#innerJoin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3355) Allow running maven tests in run-tests

2015-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333887#comment-14333887
 ] 

Apache Spark commented on SPARK-3355:
-

User 'brennonyork' has created a pull request for this issue:
https://github.com/apache/spark/pull/4734

> Allow running maven tests in run-tests
> --
>
> Key: SPARK-3355
> URL: https://issues.apache.org/jira/browse/SPARK-3355
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>
> We should have a variable called AMPLAB_JENKINS_BUILD_TOOL that decides 
> whether to run sbt or maven.
> This would allow us to simplify our build matrix in Jenkins... currently the 
> maven builds run a totally different thing than the normal run-tests builds.
> The maven build currently does something like this:
> {code}
> mvn -DskipTests -Pprofile1 -Pprofile2 ... clean package
> mvn test -Pprofile1 -Pprofile2 ... --fail-at-end
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5955) Add checkpointInterval to ALS

2015-02-23 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5955:


 Summary: Add checkpointInterval to ALS
 Key: SPARK-5955
 URL: https://issues.apache.org/jira/browse/SPARK-5955
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Xiangrui Meng


We should add checkpoint interval to ALS to prevent the following:

1. storing large shuffle files
2. stack overflow (SPARK-1106)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-1006) MLlib ALS gets stack overflow with too many iterations

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-1006.

Resolution: Duplicate

> MLlib ALS gets stack overflow with too many iterations
> --
>
> Key: SPARK-1006
> URL: https://issues.apache.org/jira/browse/SPARK-1006
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Matei Zaharia
>
> The tipping point seems to be around 50. We should fix this by checkpointing 
> the RDDs every 10-20 iterations to break the lineage chain, but checkpointing 
> currently requires HDFS installed, which not all users will have.
> We might also be able to fix DAGScheduler to not be recursive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-3080.

Resolution: Fixed

I'm closing this issue since the only way that I can re-produce this bug is the 
have some non-deterministic factor in the input RDD, which violates the 
assumption of RDD. Feel free to re-open it if anyone has a way to produce this 
bug deterministically.

> ArrayIndexOutOfBoundsException in ALS for Large datasets
> 
>
> Key: SPARK-3080
> URL: https://issues.apache.org/jira/browse/SPARK-3080
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Burak Yavuz
>Assignee: Xiangrui Meng
>
> The stack trace is below:
> {quote}
> java.lang.ArrayIndexOutOfBoundsException: 2716
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> 
> org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> {quote}
> This happened after the dataset was sub-sampled. 
> Dataset properties: ~12B ratings
> Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-3403.

Resolution: Not a Problem

[~avulanov] Did you try OpenBLAS 0.2.12, as suggested by xianyi on 
https://github.com/xianyi/OpenBLAS/issues/452? I'm closing this JIRA since this 
is a upstream bug.

> NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
> -
>
> Key: SPARK-3403
> URL: https://issues.apache.org/jira/browse/SPARK-3403
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
> Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
> described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
> MinGW64 precompiled dlls.
>Reporter: Alexander Ulanov
> Attachments: NativeNN.scala
>
>
> Code:
> val model = NaiveBayes.train(train)
> val predictionAndLabels = test.map { point =>
>   val score = model.predict(point.features)
>   (score, point.label)
> }
> predictionAndLabels.foreach(println)
> Result: 
> program crashes with: "Process finished with exit code -1073741819 
> (0xC005)" after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reopened SPARK-3080:
--

> ArrayIndexOutOfBoundsException in ALS for Large datasets
> 
>
> Key: SPARK-3080
> URL: https://issues.apache.org/jira/browse/SPARK-3080
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Burak Yavuz
>Assignee: Xiangrui Meng
>
> The stack trace is below:
> {quote}
> java.lang.ArrayIndexOutOfBoundsException: 2716
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> 
> org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> {quote}
> This happened after the dataset was sub-sampled. 
> Dataset properties: ~12B ratings
> Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-3080.

Resolution: Cannot Reproduce

> ArrayIndexOutOfBoundsException in ALS for Large datasets
> 
>
> Key: SPARK-3080
> URL: https://issues.apache.org/jira/browse/SPARK-3080
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Burak Yavuz
>Assignee: Xiangrui Meng
>
> The stack trace is below:
> {quote}
> java.lang.ArrayIndexOutOfBoundsException: 2716
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> 
> org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> {quote}
> This happened after the dataset was sub-sampled. 
> Dataset properties: ~12B ratings
> Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3436) Streaming SVM

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3436:
-
Assignee: (was: Liquan Pei)

> Streaming SVM 
> --
>
> Key: SPARK-3436
> URL: https://issues.apache.org/jira/browse/SPARK-3436
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Liquan Pei
>
> Implement online learning with kernels according to 
> http://users.cecs.anu.edu.au/~williams/papers/P172.pdf
> The algorithms proposed in the above paper are implemented in R 
> (http://users.cecs.anu.edu.au/~williams/papers/P172.pdf) and MADlib 
> (http://doc.madlib.net/latest/group__grp__kernmach.html)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5541) Allow running Maven or SBT in run-tests

2015-02-23 Thread Brennon York (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333991#comment-14333991
 ] 

Brennon York commented on SPARK-5541:
-

Just pushed up a PR for 
[SPARK-3355|https://issues.apache.org/jira/browse/SPARK-3355], but could easily 
change the env. var. to SPARK_BUILD_TOOL (rather than 
AMPLAB_JENKINS_BUILD_TOOL) if that would be the now-preferred route.

> Allow running Maven or SBT in run-tests
> ---
>
> Key: SPARK-5541
> URL: https://issues.apache.org/jira/browse/SPARK-5541
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Nicholas Chammas
>
> It would be nice if we had a hook for the spark test scripts to run with 
> Maven in addition to running with SBT. Right now it is difficult for us to 
> test pull requests in maven and we get master build breaks because of it. A 
> simple first step is to modify run-tests to allow building with maven. Then 
> we can add a second PRB that invokes this maven build. I would just add an 
> env var called SPARK_BUILD_TOOL that can be set to "sbt" or "mvn". And make 
> sure the associated logic works in either case. If we don't want to have the 
> fancy "SQL" only stuff in Maven, that's fine too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3435) Distributed matrix multiplication

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-3435.

Resolution: Duplicate

I'm closing this JIRA because it is hard to control data locality. We 
implemented a basic block matrix multiplication in SPARK-3975.

> Distributed matrix multiplication
> -
>
> Key: SPARK-3435
> URL: https://issues.apache.org/jira/browse/SPARK-3435
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>
> This JIRA is for discussing how to implement distributed matrix 
> multiplication efficiently. It would be nice if we can utilize 
> communication-avoiding algorithms (http://www.cs.berkeley.edu/~odedsc/CS294/).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3436) Streaming SVM

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-3436.

Resolution: Duplicate

> Streaming SVM 
> --
>
> Key: SPARK-3436
> URL: https://issues.apache.org/jira/browse/SPARK-3436
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Liquan Pei
>
> Implement online learning with kernels according to 
> http://users.cecs.anu.edu.au/~williams/papers/P172.pdf
> The algorithms proposed in the above paper are implemented in R 
> (http://users.cecs.anu.edu.au/~williams/papers/P172.pdf) and MADlib 
> (http://doc.madlib.net/latest/group__grp__kernmach.html)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5673) Implement Streaming wrapper for all linear methos

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5673:
-
Assignee: Kirill A. Korinskiy

> Implement Streaming wrapper for all linear methos
> -
>
> Key: SPARK-5673
> URL: https://issues.apache.org/jira/browse/SPARK-5673
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Kirill A. Korinskiy
>Assignee: Kirill A. Korinskiy
>
> Now spark had streaming wrapper for Logistic and Linear regressions only.
> So, implement wrapper for SVM, Lasso and Ridge Regression will make streaming 
> fashion more useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4039) KMeans support sparse cluster centers

2015-02-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333997#comment-14333997
 ] 

Xiangrui Meng commented on SPARK-4039:
--

I changed the JIRA title to be more descriptive of this issue.

> KMeans support sparse cluster centers
> -
>
> Key: SPARK-4039
> URL: https://issues.apache.org/jira/browse/SPARK-4039
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Antoine Amend
>
> When the number of features is not known, it might be quite helpful to create 
> sparse vectors using HashingTF.transform. KMeans transforms centers vectors 
> to dense vectors 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L307),
>  therefore leading to OutOfMemory (even with small k).
> Any way to keep vectors sparse ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4039) KMeans support sparse cluster centers

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4039:
-
Summary: KMeans support sparse cluster centers  (was: KMeans support 
HashingTF vectors)

> KMeans support sparse cluster centers
> -
>
> Key: SPARK-4039
> URL: https://issues.apache.org/jira/browse/SPARK-4039
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Antoine Amend
>
> When the number of features is not known, it might be quite helpful to create 
> sparse vectors using HashingTF.transform. KMeans transforms centers vectors 
> to dense vectors 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L307),
>  therefore leading to OutOfMemory (even with small k).
> Any way to keep vectors sparse ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4284) BinaryClassificationMetrics precision-recall method names should correspond to return types

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-4284.

Resolution: Won't Fix

I'm closing this JIRA per discussion on the Github PR page.

> BinaryClassificationMetrics precision-recall method names should correspond 
> to return types
> ---
>
> Key: SPARK-4284
> URL: https://issues.apache.org/jira/browse/SPARK-4284
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Assignee: Kai Sasaki
>Priority: Minor
>
> BinaryClassificationMetrics has several methods which work with (recall, 
> precision) pairs, but the method names all use the wrong order ("pr").  This 
> order should be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4510:
-
Labels: clustering features  (was: features)

> Add k-medoids Partitioning Around Medoids (PAM) algorithm
> -
>
> Key: SPARK-4510
> URL: https://issues.apache.org/jira/browse/SPARK-4510
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Fan Jiang
>Assignee: Fan Jiang
>  Labels: clustering, features
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> PAM (k-medoids) is more robust to noise and outliers as compared to k-means 
> because it minimizes a sum of pairwise dissimilarities instead of a sum of 
> squared Euclidean distances. A medoid can be defined as the object of a 
> cluster, whose average dissimilarity to all the objects in the cluster is 
> minimal i.e. it is a most centrally located point in the cluster.
> The most common realisation of k-medoid clustering is the Partitioning Around 
> Medoids (PAM) algorithm and is as follows:
> Initialize: randomly select (without replacement) k of the n data points as 
> the medoids
> Associate each data point to the closest medoid. ("closest" here is defined 
> using any valid distance metric, most commonly Euclidean distance, Manhattan 
> distance or Minkowski distance)
> For each medoid m
> For each non-medoid data point o
> Swap m and o and compute the total cost of the configuration
> Select the configuration with the lowest cost.
> Repeat steps 2 to 4 until there is no change in the medoid.
> The new feature for MLlib will contain 5 new files
> /main/scala/org/apache/spark/mllib/clustering/PAM.scala
> /main/scala/org/apache/spark/mllib/clustering/PAMModel.scala
> /main/scala/org/apache/spark/mllib/clustering/LocalPAM.scala
> /test/scala/org/apache/spark/mllib/clustering/PAMSuite.scala
> /main/scala/org/apache/spark/examples/mllib/KMedoids.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4956) Vector Initialization error when initialize a Sparse Vector by calling Vectors.sparse(size, indices, values)

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-4956.

Resolution: Won't Fix

I'm closing this PR per discussion on the PR page.

> Vector Initialization error when initialize a Sparse Vector by calling 
> Vectors.sparse(size, indices, values)
> 
>
> Key: SPARK-4956
> URL: https://issues.apache.org/jira/browse/SPARK-4956
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: liaoyuxi
>Priority: Minor
>  Labels: patch
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> When I initialize a sparse vector by calling the Vectors.sparse(size, 
> indices, values), the vector will be all zeros if the indices is not ordered 
> without giving any error or warning. 
> A simple sentence to order the indicies with values can fix this bug



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5010) native openblas library doesn't work: undefined symbol: cblas_dscal

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-5010.

Resolution: Not a Problem

I'm closing this PR because it is a upstream issue with the native BLAS library.

> native openblas library doesn't work: undefined symbol: cblas_dscal
> ---
>
> Key: SPARK-5010
> URL: https://issues.apache.org/jira/browse/SPARK-5010
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
> Environment: standalone
>Reporter: Tomas Hudik
>Priority: Minor
>  Labels: mllib, openblas
>
> 1. compiled and installed open blas library
> 2, ln -s libopenblas_sandybridgep-r0.2.13.so /usr/lib/libblas.so.3
> 3. compiled and built spark:
> mvn -Pnetlib-lgpl -DskipTests clean compile package
> 4. run: bin/run-example  mllib.LinearRegression 
> data/mllib/sample_libsvm_data.txt
> 14/12/30 18:39:57 INFO BlockManagerMaster: Trying to register BlockManager
> 14/12/30 18:39:57 INFO BlockManagerMasterActor: Registering block manager 
> localhost:34297 with 265.1 MB RAM, BlockManagerId(, localhost, 34297)
> 14/12/30 18:39:57 INFO BlockManagerMaster: Registered BlockManager
> 14/12/30 18:39:58 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 14/12/30 18:39:58 WARN LoadSnappy: Snappy native library not loaded
> Training: 80, test: 20.
> /usr/local/lib/jdk1.8.0//bin/java: symbol lookup error: 
> /tmp/jniloader1826801168744171087netlib-native_system-linux-x86_64.so: 
> undefined symbol: cblas_dscal
> I followed guide: https://spark.apache.org/docs/latest/mllib-guide.html 
> section dependencies.
> Am I missing something?
> How to force Spark to use openblas library?
> Thanks, Tomas



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5940) Graph Loader: refactor + add more formats

2015-02-23 Thread Magellanea (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334013#comment-14334013
 ] 

Magellanea commented on SPARK-5940:
---

[~lukovnikov] Thanks a lot for the reply, Do you know who can I mention/tag in 
this discussion from the Spark/GraphX team?

Thanks

> Graph Loader: refactor + add more formats
> -
>
> Key: SPARK-5940
> URL: https://issues.apache.org/jira/browse/SPARK-5940
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: lukovnikov
>Priority: Minor
>
> Currently, the only graph loader is GraphLoader.edgeListFile. [SPARK-5280] 
> adds a RDF graph loader.
> However, as Takeshi Yamamuro suggested on github [SPARK-5280], 
> https://github.com/apache/spark/pull/4650, it might be interesting to make 
> GraphLoader an interface with several implementations for different formats. 
> And maybe it's good to make a façade graph loader that provides a unified 
> interface to all loaders.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5226:
-
Labels: DBSCAN clustering  (was: DBSCAN)

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN, clustering
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-794) Remove sleep() in ClusterScheduler.stop

2015-02-23 Thread Brennon York (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334020#comment-14334020
 ] 

Brennon York commented on SPARK-794:


[~srowen] [~joshrosen] bump on this. Would assume things are stable with the 
removal of the sleep method, but want to double check. Thinking we can close 
this ticket out.

> Remove sleep() in ClusterScheduler.stop
> ---
>
> Key: SPARK-794
> URL: https://issues.apache.org/jira/browse/SPARK-794
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 0.9.0
>Reporter: Matei Zaharia
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> This temporary change made a while back slows down the unit tests quite a bit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-02-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334021#comment-14334021
 ] 

Xiangrui Meng commented on SPARK-5261:
--

Could you try a larger minCount to reduce the vocabulary size?

> In some cases ,The value of word's vector representation is too big
> ---
>
> Key: SPARK-5261
> URL: https://issues.apache.org/jira/browse/SPARK-5261
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Guoqiang Li
>
> {code}
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(36)
> {code}
> The average absolute value of the word's vector representation is 60731.8
> {code}
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(1)
> {code}
> The average  absolute value of the word's vector representation is 0.13889



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5405) Spark clusterer should support high dimensional data

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5405:
-
Labels: clustering  (was: )

> Spark clusterer should support high dimensional data
> 
>
> Key: SPARK-5405
> URL: https://issues.apache.org/jira/browse/SPARK-5405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Derrick Burns
>  Labels: clustering
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> The MLLIB clusterer works well for low  (<200) dimensional data.  However, 
> performance is linear with the number of dimensions.  So, for practical 
> purposes, it is not very useful for high dimensional data.  
> Depending on the data type, one can embed the high dimensional data into 
> lower dimensional spaces in a distance-preserving way.  The Spark clusterer 
> should support such embedding.
> An example implementation that supports high dimensional data is here:
> https://github.com/derrickburns/generalized-kmeans-clustering



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4039) KMeans support sparse cluster centers

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-4039.

Resolution: Duplicate

> KMeans support sparse cluster centers
> -
>
> Key: SPARK-4039
> URL: https://issues.apache.org/jira/browse/SPARK-4039
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Antoine Amend
>
> When the number of features is not known, it might be quite helpful to create 
> sparse vectors using HashingTF.transform. KMeans transforms centers vectors 
> to dense vectors 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L307),
>  therefore leading to OutOfMemory (even with small k).
> Any way to keep vectors sparse ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-4039) KMeans support sparse cluster centers

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reopened SPARK-4039:
--

> KMeans support sparse cluster centers
> -
>
> Key: SPARK-4039
> URL: https://issues.apache.org/jira/browse/SPARK-4039
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Antoine Amend
>
> When the number of features is not known, it might be quite helpful to create 
> sparse vectors using HashingTF.transform. KMeans transforms centers vectors 
> to dense vectors 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L307),
>  therefore leading to OutOfMemory (even with small k).
> Any way to keep vectors sparse ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1182) Sort the configuration parameters in configuration.md

2015-02-23 Thread Brennon York (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334025#comment-14334025
 ] 

Brennon York commented on SPARK-1182:
-

Given [~joshrosen]'s comments on the PR making merge-conflict hell, would it be 
better just to scratch this as an issue and close everything out? Its either 
that or deal with all the merge conflicts for any / all backports moving 
forward. Thoughts?

> Sort the configuration parameters in configuration.md
> -
>
> Key: SPARK-1182
> URL: https://issues.apache.org/jira/browse/SPARK-1182
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Reporter: Reynold Xin
>Assignee: prashant
>Priority: Minor
>
> It is a little bit confusing right now since the config options are all over 
> the place in some arbitrarily sorted order.
> https://github.com/apache/spark/blob/master/docs/configuration.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5405) Spark clusterer should support high dimensional data

2015-02-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334027#comment-14334027
 ] 

Xiangrui Meng commented on SPARK-5405:
--

Dimension reduction should be separated from the k-means implementation. We can 
add distance-preserving methods as feature transformers. [~derrickburns] Could 
you update the JIRA title?

> Spark clusterer should support high dimensional data
> 
>
> Key: SPARK-5405
> URL: https://issues.apache.org/jira/browse/SPARK-5405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Derrick Burns
>  Labels: clustering
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> The MLLIB clusterer works well for low  (<200) dimensional data.  However, 
> performance is linear with the number of dimensions.  So, for practical 
> purposes, it is not very useful for high dimensional data.  
> Depending on the data type, one can embed the high dimensional data into 
> lower dimensional spaces in a distance-preserving way.  The Spark clusterer 
> should support such embedding.
> An example implementation that supports high dimensional data is here:
> https://github.com/derrickburns/generalized-kmeans-clustering



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5490) KMeans costs can be incorrect if tasks need to be rerun

2015-02-23 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334030#comment-14334030
 ] 

Xiangrui Meng commented on SPARK-5490:
--

[~sandyr] This is a bug in core. Could you link the JIRA? 

> KMeans costs can be incorrect if tasks need to be rerun
> ---
>
> Key: SPARK-5490
> URL: https://issues.apache.org/jira/browse/SPARK-5490
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> KMeans uses accumulators to compute the cost of a clustering at each 
> iteration.
> Each time a ShuffleMapTask completes, it increments the accumulators at the 
> driver.  If a task runs twice because of failures, the accumulators get 
> incremented twice.
> KMeans uses accumulators in ShuffleMapTasks.  This means that a task's cost 
> can end up being double-counted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5490) KMeans costs can be incorrect if tasks need to be rerun

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5490:
-
Target Version/s: 1.4.0

> KMeans costs can be incorrect if tasks need to be rerun
> ---
>
> Key: SPARK-5490
> URL: https://issues.apache.org/jira/browse/SPARK-5490
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> KMeans uses accumulators to compute the cost of a clustering at each 
> iteration.
> Each time a ShuffleMapTask completes, it increments the accumulators at the 
> driver.  If a task runs twice because of failures, the accumulators get 
> incremented twice.
> KMeans uses accumulators in ShuffleMapTasks.  This means that a task's cost 
> can end up being double-counted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5832) Add Affinity Propagation clustering algorithm

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5832:
-
Labels: clustering  (was: )

> Add Affinity Propagation clustering algorithm
> -
>
> Key: SPARK-5832
> URL: https://issues.apache.org/jira/browse/SPARK-5832
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>  Labels: clustering
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5927) Modify FPGrowth's partition strategy to reduce transactions in partitions

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-5927.

Resolution: Won't Fix

I'm closing this JIRA per discussion on the PR page.

> Modify FPGrowth's partition strategy to reduce transactions in partitions
> -
>
> Key: SPARK-5927
> URL: https://issues.apache.org/jira/browse/SPARK-5927
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5405) Spark clusterer should support high dimensional data

2015-02-23 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334037#comment-14334037
 ] 

Derrick Burns commented on SPARK-5405:
--

Agreed.

On Mon, Feb 23, 2015 at 2:48 PM, Xiangrui Meng (JIRA) 



> Spark clusterer should support high dimensional data
> 
>
> Key: SPARK-5405
> URL: https://issues.apache.org/jira/browse/SPARK-5405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Derrick Burns
>  Labels: clustering
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> The MLLIB clusterer works well for low  (<200) dimensional data.  However, 
> performance is linear with the number of dimensions.  So, for practical 
> purposes, it is not very useful for high dimensional data.  
> Depending on the data type, one can embed the high dimensional data into 
> lower dimensional spaces in a distance-preserving way.  The Spark clusterer 
> should support such embedding.
> An example implementation that supports high dimensional data is here:
> https://github.com/derrickburns/generalized-kmeans-clustering



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5016:
-
Labels: clustering  (was: )

> GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
> ---
>
> Key: SPARK-5016
> URL: https://issues.apache.org/jira/browse/SPARK-5016
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>  Labels: clustering
>
> If numFeatures or k are large, GMM EM should distribute the matrix inverse 
> computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5490) KMeans costs can be incorrect if tasks need to be rerun

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5490:
-
Labels: clustering  (was: )

> KMeans costs can be incorrect if tasks need to be rerun
> ---
>
> Key: SPARK-5490
> URL: https://issues.apache.org/jira/browse/SPARK-5490
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>  Labels: clustering
>
> KMeans uses accumulators to compute the cost of a clustering at each 
> iteration.
> Each time a ShuffleMapTask completes, it increments the accumulators at the 
> driver.  If a task runs twice because of failures, the accumulators get 
> incremented twice.
> KMeans uses accumulators in ShuffleMapTasks.  This means that a task's cost 
> can end up being double-counted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5272:
-
Labels: clustering  (was: )

> Refactor NaiveBayes to support discrete and continuous labels,features
> --
>
> Key: SPARK-5272
> URL: https://issues.apache.org/jira/browse/SPARK-5272
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>  Labels: clustering
>
> This JIRA is to discuss refactoring NaiveBayes in order to support both 
> discrete and continuous labels and features.
> Currently, NaiveBayes supports only discrete labels and features.
> Proposal: Generalize it to support continuous values as well.
> Some items to discuss are:
> * How commonly are continuous labels/features used in practice?  (Is this 
> necessary?)
> * What should the API look like?
> ** E.g., should NB have multiple classes for each type of label/feature, or 
> should it take a general Factor type parameter?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4039) KMeans support sparse cluster centers

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4039:
-
Labels: clustering  (was: )

> KMeans support sparse cluster centers
> -
>
> Key: SPARK-4039
> URL: https://issues.apache.org/jira/browse/SPARK-4039
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Antoine Amend
>  Labels: clustering
>
> When the number of features is not known, it might be quite helpful to create 
> sparse vectors using HashingTF.transform. KMeans transforms centers vectors 
> to dense vectors 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L307),
>  therefore leading to OutOfMemory (even with small k).
> Any way to keep vectors sparse ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2429) Hierarchical Implementation of KMeans

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2429:
-
Labels: clustering  (was: )

> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: clustering
> Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3439) Add Canopy Clustering Algorithm

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3439:
-
Labels: clustering  (was: )

> Add Canopy Clustering Algorithm
> ---
>
> Key: SPARK-3439
> URL: https://issues.apache.org/jira/browse/SPARK-3439
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Yu Ishikawa
>Assignee: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: clustering
>
> The canopy clustering algorithm is an unsupervised pre-clustering algorithm. 
> It is often used as a preprocessing step for the K-means algorithm or the 
> Hierarchical clustering algorithm. It is intended to speed up clustering 
> operations on large data sets, where using another algorithm directly may be 
> impractical due to the size of the data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3218) K-Means clusterer can fail on degenerate data

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3218:
-
Labels: clustering  (was: )

> K-Means clusterer can fail on degenerate data
> -
>
> Key: SPARK-3218
> URL: https://issues.apache.org/jira/browse/SPARK-3218
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>Assignee: Derrick Burns
>  Labels: clustering
>
> The KMeans parallel implementation selects points to be cluster centers with 
> probability weighted by their distance to cluster centers.  However, if there 
> are fewer than k DISTINCT points in the data set, this approach will fail.  
> Further, the recent checkin to work around this problem results in selection 
> of the same point repeatedly as a cluster center. 
> The fix is to allow fewer than k cluster centers to be selected.  This 
> requires several changes to the code, as the number of cluster centers is 
> woven into the implementation.
> I have a version of the code that addresses this problem, AND generalizes the 
> distance metric.  However, I see that there are literally hundreds of 
> outstanding pull requests.  If someone will commit to working with me to 
> sponsor the pull request, I will create it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3220) K-Means clusterer should perform K-Means initialization in parallel

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3220:
-
Labels: clustering  (was: )

> K-Means clusterer should perform K-Means initialization in parallel
> ---
>
> Key: SPARK-3220
> URL: https://issues.apache.org/jira/browse/SPARK-3220
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Derrick Burns
>  Labels: clustering
>
> The LocalKMeans method should be replaced with a parallel implementation.  As 
> it stands now, it becomes a bottleneck for large data sets. 
> I have implemented this functionality in my version of the clusterer.  
> However, I see that there are hundreds of outstanding pull requests.  If 
> someone on the team wants to sponsor the pull request, I will create one.  
> Otherwise, I will just maintain my own private fork of the clusterer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3850) Scala style: disallow trailing spaces

2015-02-23 Thread Brennon York (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334038#comment-14334038
 ] 

Brennon York commented on SPARK-3850:
-

This made it into the [master 
branch|https://github.com/apache/spark/blob/master/scalastyle-config.xml#L54] 
at some point already. We can close this issue.

/cc [~srowen] [~pwendell]

> Scala style: disallow trailing spaces
> -
>
> Key: SPARK-3850
> URL: https://issues.apache.org/jira/browse/SPARK-3850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Nicholas Chammas
>
> [Ted Yu on the dev 
> list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E]
>  suggested using {{WhitespaceEndOfLineChecker}} here: 
> http://www.scalastyle.org/rules-0.1.0.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3219:
-
Labels: clustering  (was: )

> K-Means clusterer should support Bregman distance functions
> ---
>
> Key: SPARK-3219
> URL: https://issues.apache.org/jira/browse/SPARK-3219
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Derrick Burns
>Assignee: Derrick Burns
>  Labels: clustering
>
> The K-Means clusterer supports the Euclidean distance metric.  However, it is 
> rather straightforward to support Bregman 
> (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
> distance functions which would increase the utility of the clusterer 
> tremendously.
> I have modified the clusterer to support pluggable distance functions.  
> However, I notice that there are hundreds of outstanding pull requests.  If 
> someone is willing to work with me to sponsor the work through the process, I 
> will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3504) KMeans optimization: track distances and unmoved cluster centers across iterations

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3504:
-
Labels: clustering  (was: )

> KMeans optimization: track distances and unmoved cluster centers across 
> iterations
> --
>
> Key: SPARK-3504
> URL: https://issues.apache.org/jira/browse/SPARK-3504
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>  Labels: clustering
>
> The 1.0.2 implementation of the KMeans clusterer is VERY inefficient because 
> recomputes all distances to all cluster centers on each iteration.  In later 
> iterations of Lloyd's algorithm, points don't change clusters and clusters 
> don't move.  
> By 1) tracking which clusters move and 2) tracking for each point which 
> cluster it belongs to and the distance to that cluster, one can avoid 
> recomputing distances in many cases with very little increase in memory 
> requirements. 
> I implemented this new algorithm and the results were fantastic. Using 16 
> c3.8xlarge machines on EC2,  the clusterer converged in 13 iterations on 
> 1,714,654 (182 dimensional) points and 20,000 clusters in 24 minutes. Here 
> are the running times for the first 7 rounds:
> 6 minutes and 42 second
> 7 minutes and 7 seconds
> 7 minutes 13 seconds
> 1 minutes 18 seconds
> 30 seconds
> 18 seconds
> 12 seconds
> Without this improvement, all rounds would have taken roughly 7 minutes, 
> resulting in Lloyd's iterations taking  7 * 13 = 91 minutes. In other words, 
> this improvement resulting in a reduction of roughly 75% in running time with 
> no loss of accuracy.
> My implementation is a rewrite of the existing 1.0.2 implementation.  It is 
> not a simple modification of the existing implementation.  Please let me know 
> if you are interested in this new implementation.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2336) Approximate k-NN Models for MLLib

2015-02-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2336:
-
Labels: clustering features newbie  (was: features newbie)

> Approximate k-NN Models for MLLib
> -
>
> Key: SPARK-2336
> URL: https://issues.apache.org/jira/browse/SPARK-2336
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Brian Gawalt
>Priority: Minor
>  Labels: clustering, features
>
> After tackling the general k-Nearest Neighbor model as per 
> https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
> also offer approximate k-Nearest Neighbor. A promising approach would involve 
> building a kd-tree variant within from each partition, a la
> http://www.autonlab.org/autonweb/14714.html?branch=1&language=2
> This could offer a simple non-linear ML model that can label new data with 
> much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >