[jira] [Created] (SPARK-5506) java.lang.ClassCastException using lambda expressions in combination of spark and Servlet

2015-01-30 Thread Milad Khajavi (JIRA)
Milad Khajavi created SPARK-5506:


 Summary: java.lang.ClassCastException using lambda expressions in 
combination of spark and Servlet
 Key: SPARK-5506
 URL: https://issues.apache.org/jira/browse/SPARK-5506
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: spark server: Ubuntu 14.04 amd64

$ java -version
java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)


Reporter: Milad Khajavi
Priority: Blocker


I'm trying to build a web API for my Apache spark jobs using sparkjava.com 
framework. My code is:

@Override
public void init() {
get("/hello",
(req, res) -> {
String sourcePath = "hdfs://spark:54310/input/*";

SparkConf conf = new SparkConf().setAppName("LineCount");
conf.setJars(new String[] { 
"/home/sam/resin-4.0.42/webapps/test.war" });
File configFile = new File("config.properties");

String sparkURI = "spark://hamrah:7077";

conf.setMaster(sparkURI);
conf.set("spark.driver.allowMultipleContexts", "true");
JavaSparkContext sc = new JavaSparkContext(conf);

@SuppressWarnings("resource")
JavaRDD log = sc.textFile(sourcePath);

JavaRDD lines = log.filter(x -> {
return true;
});

return lines.count();
});
}
If I remove the lambda expression or put it inside a simple jar rather than a 
web service (somehow a Servlet) it will run without any error. But using a 
lambda expression inside a Servlet will result this exception:

15/01/28 10:36:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
hamrah): java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.api.java.JavaRDD$$anonfun$filter$1.f$1 of type 
org.apache.spark.api.java.function.Function in instance of 
org.apache.spark.api.java.JavaRDD$$anonfun$filter$1
at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2089)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1999)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
P.S: I tried combination of jersey and javaspark with jetty, tomcat and resin 
and all of them led me to the same result.

Here the same issue: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YARN-java-lang-ClassCastException-SerializedLambda-to-org-apache-spark-api-java-function-Fu1-tt21261.html

This is my colleague question in stackoverflow: 
http://stackoverflow.com/questions/28186607/java-lang-classcastexception-using-lambda-expressions-in-spark-job-on-remote-ser



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5307) Add utility to help with NotSerializableException debugging

2015-01-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5307.

   Resolution: Fixed
Fix Version/s: 1.3.0

> Add utility to help with NotSerializableException debugging
> ---
>
> Key: SPARK-5307
> URL: https://issues.apache.org/jira/browse/SPARK-5307
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.3.0
>
>
> Scala closures can easily capture objects unintentionally, especially with 
> implicit arguments. I think we can do more than just relying on the users 
> being smart about using sun.io.serialization.extendedDebugInfo to find more 
> debug information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag

2015-01-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3694:
---
Fix Version/s: 1.3.0

> Allow printing object graph of tasks/RDD's with a debug flag
> 
>
> Key: SPARK-3694
> URL: https://issues.apache.org/jira/browse/SPARK-3694
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Ilya Ganelin
>  Labels: starter
> Fix For: 1.3.0
>
>
> This would be useful for debugging extra references inside of RDD's
> Here is an example for inspiration:
> http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html
> We'd want to print this trace for both the RDD serialization inside of the 
> DAGScheduler and the task serialization in the TaskSetManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag

2015-01-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3694:
---
Target Version/s: 1.3.0  (was: 1.2.0)

> Allow printing object graph of tasks/RDD's with a debug flag
> 
>
> Key: SPARK-3694
> URL: https://issues.apache.org/jira/browse/SPARK-3694
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Ilya Ganelin
>  Labels: starter
> Fix For: 1.3.0
>
>
> This would be useful for debugging extra references inside of RDD's
> Here is an example for inspiration:
> http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html
> We'd want to print this trace for both the RDD serialization inside of the 
> DAGScheduler and the task serialization in the TaskSetManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5307) Add utility to help with NotSerializableException debugging

2015-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299696#comment-14299696
 ] 

Apache Spark commented on SPARK-5307:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4297

> Add utility to help with NotSerializableException debugging
> ---
>
> Key: SPARK-5307
> URL: https://issues.apache.org/jira/browse/SPARK-5307
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Scala closures can easily capture objects unintentionally, especially with 
> implicit arguments. I think we can do more than just relying on the users 
> being smart about using sun.io.serialization.extendedDebugInfo to find more 
> debug information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1010) Update all unit tests to use SparkConf instead of system properties

2015-01-30 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299683#comment-14299683
 ] 

Josh Rosen commented on SPARK-1010:
---

The pull request for SPARK-5425 fixes a bug in the ResetSystemProperties trait 
added here.  My original implementation of that trait didn't properly reset the 
system properties because it didn't perform a proper clone: 
https://github.com/apache/spark/pull/4220#issuecomment-71992373.

> Update all unit tests to use SparkConf instead of system properties
> ---
>
> Key: SPARK-1010
> URL: https://issues.apache.org/jira/browse/SPARK-1010
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 0.9.0
>Reporter: Patrick Wendell
>Assignee: Josh Rosen
>Priority: Minor
> Fix For: 1.0.3, 1.3.0, 1.1.2, 1.2.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4981) Add a streaming singular value decomposition

2015-01-30 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299682#comment-14299682
 ] 

Tathagata Das commented on SPARK-4981:
--

+1 This will be awesome :P

> Add a streaming singular value decomposition
> 
>
> Key: SPARK-4981
> URL: https://issues.apache.org/jira/browse/SPARK-4981
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, Streaming
>Reporter: Jeremy Freeman
>
> This is for tracking WIP on a streaming singular value decomposition 
> implementation. This will likely be more complex than the existing streaming 
> algorithms (k-means, regression), but should be possible using the family of 
> sequential update rule outlined in this paper:
> "Fast low-rank modifications of the thin singular value decomposition"
> by Matthew Brand
> http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2614) Add the spark-examples-xxx-.jar to the Debian packages created with mvn ... -Pdeb (using assembly/pom.xml)

2015-01-30 Thread Christian Tzolov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299668#comment-14299668
 ] 

Christian Tzolov commented on SPARK-2614:
-

Not sure if this issue is still alive? In case someone is interested i've 
rebased and realigned the #1611 pull-request to upstream/master (e643de42a7)

> Add the spark-examples-xxx-.jar to the Debian packages created with mvn ... 
> -Pdeb (using assembly/pom.xml)
> --
>
> Key: SPARK-2614
> URL: https://issues.apache.org/jira/browse/SPARK-2614
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Deploy
>Reporter: Christian Tzolov
>
> The tar.gz distribution includes already the spark-examples.jar in the 
> bundle. It is a common practice for installers to run SparkPi as a smoke test 
> to verify that the installation is OK
> /usr/share/spark/bin/spark-submit \
>   --num-executors 10  --master yarn-cluster \
>   --class org.apache.spark.examples.SparkPi \
>   /usr/share/spark/jars/spark-examples-1.0.1-hadoop2.2.0.jar 10



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4981) Add a streaming singular value decomposition

2015-01-30 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299666#comment-14299666
 ] 

Reza Zadeh commented on SPARK-4981:
---

To be model parallel, we can simply warm-start the current ALS implementation 
in org.apache.spark.mllib.recommendation

The work involved would be to expose a warm-start option in ALS, and then redo 
training with say 2 iterations instead of 10, with each batch of RDDs.

The stream would be over batches of Ratings.

This should be the simplest option.


> Add a streaming singular value decomposition
> 
>
> Key: SPARK-4981
> URL: https://issues.apache.org/jira/browse/SPARK-4981
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, Streaming
>Reporter: Jeremy Freeman
>
> This is for tracking WIP on a streaming singular value decomposition 
> implementation. This will likely be more complex than the existing streaming 
> algorithms (k-means, regression), but should be possible using the family of 
> sequential update rule outlined in this paper:
> "Fast low-rank modifications of the thin singular value decomposition"
> by Matthew Brand
> http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-01-30 Thread Tor Myklebust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tor Myklebust updated SPARK-5472:
-
Description: 
It would be nice to be able to make a table in a JDBC database appear as a 
table in Spark SQL.  This would let users, for instance, perform a JOIN between 
a DataFrame in Spark SQL with a table in a Postgres database.

It might also be nice to be able to go the other direction -- save a DataFrame 
to a database -- for instance in an ETL job.

Edited to clarify:  Both of these tasks are certainly possible to accomplish at 
the moment with a little bit of ad-hoc glue code.  However, there is no 
fundamental reason why the user should need to supply the table schema and some 
code for pulling data out of a ResultSet row into a Catalyst Row structure when 
this information can be derived from the schema of the database table itself.

  was:
It would be nice to be able to make a table in a JDBC database appear as a 
table in Spark SQL.  This would let users, for instance, perform a JOIN between 
a DataFrame in Spark SQL with a table in a Postgres database.

It might also be nice to be able to go the other direction -- save a DataFrame 
to a database -- for instance in an ETL job.


> Add support for reading from and writing to a JDBC database
> ---
>
> Key: SPARK-5472
> URL: https://issues.apache.org/jira/browse/SPARK-5472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tor Myklebust
>Assignee: Tor Myklebust
>Priority: Blocker
>
> It would be nice to be able to make a table in a JDBC database appear as a 
> table in Spark SQL.  This would let users, for instance, perform a JOIN 
> between a DataFrame in Spark SQL with a table in a Postgres database.
> It might also be nice to be able to go the other direction -- save a 
> DataFrame to a database -- for instance in an ETL job.
> Edited to clarify:  Both of these tasks are certainly possible to accomplish 
> at the moment with a little bit of ad-hoc glue code.  However, there is no 
> fundamental reason why the user should need to supply the table schema and 
> some code for pulling data out of a ResultSet row into a Catalyst Row 
> structure when this information can be derived from the schema of the 
> database table itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-01-30 Thread Tor Myklebust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299657#comment-14299657
 ] 

Tor Myklebust commented on SPARK-5472:
--

You appear to understand the issue perfectly.  You have to write the case class 
mapper, work out the schema, and register the thing as a temporary table.  Once 
you've done all that for one table, you have to do something rather similar for 
the next table you want to load.  And all this work requires Scala coding 
rather than a short SQL query.  This is more complexity for the user than the 
problem really deserves.

> Add support for reading from and writing to a JDBC database
> ---
>
> Key: SPARK-5472
> URL: https://issues.apache.org/jira/browse/SPARK-5472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tor Myklebust
>Assignee: Tor Myklebust
>Priority: Blocker
>
> It would be nice to be able to make a table in a JDBC database appear as a 
> table in Spark SQL.  This would let users, for instance, perform a JOIN 
> between a DataFrame in Spark SQL with a table in a Postgres database.
> It might also be nice to be able to go the other direction -- save a 
> DataFrame to a database -- for instance in an ETL job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-01-30 Thread Tor Myklebust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299657#comment-14299657
 ] 

Tor Myklebust edited comment on SPARK-5472 at 1/31/15 4:12 AM:
---

You appear to understand the issue perfectly.  You have to write the case class 
mapper, work out the schema, and register the thing as a temporary table.  Once 
you've done all that for one table, you have to do something rather similar for 
the next table you want to load.  And all this work requires Scala coding 
rather than a short SQL query.  This is more complexity for the user than the 
problem really deserves, and it appears to be easy to automate in a reasonably 
transparent way.


was (Author: tmyklebu):
You appear to understand the issue perfectly.  You have to write the case class 
mapper, work out the schema, and register the thing as a temporary table.  Once 
you've done all that for one table, you have to do something rather similar for 
the next table you want to load.  And all this work requires Scala coding 
rather than a short SQL query.  This is more complexity for the user than the 
problem really deserves.

> Add support for reading from and writing to a JDBC database
> ---
>
> Key: SPARK-5472
> URL: https://issues.apache.org/jira/browse/SPARK-5472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tor Myklebust
>Assignee: Tor Myklebust
>Priority: Blocker
>
> It would be nice to be able to make a table in a JDBC database appear as a 
> table in Spark SQL.  This would let users, for instance, perform a JOIN 
> between a DataFrame in Spark SQL with a table in a Postgres database.
> It might also be nice to be able to go the other direction -- save a 
> DataFrame to a database -- for instance in an ETL job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-01-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299642#comment-14299642
 ] 

Reynold Xin commented on SPARK-5472:


[~chinnitv], thanks for commenting. The existing JdbcRDD doesn't really solve 
the use case of loading in. In particular, it cannot:
1. Be used in pure SQL
2. Be used without tons of glue code (converting resultSet to case classes is 
still more code than necessary to write)
3. It does not support filter pushdown, unless users manually write the 
pushdown.



> Add support for reading from and writing to a JDBC database
> ---
>
> Key: SPARK-5472
> URL: https://issues.apache.org/jira/browse/SPARK-5472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tor Myklebust
>Assignee: Tor Myklebust
>Priority: Blocker
>
> It would be nice to be able to make a table in a JDBC database appear as a 
> table in Spark SQL.  This would let users, for instance, perform a JOIN 
> between a DataFrame in Spark SQL with a table in a Postgres database.
> It might also be nice to be able to go the other direction -- save a 
> DataFrame to a database -- for instance in an ETL job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-01-30 Thread Anand Mohan Tumuluri (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299633#comment-14299633
 ] 

Anand Mohan Tumuluri commented on SPARK-5472:
-

Pardon my ignorance but I think
JdbcRdd can be given a ResultSet to case class mapper which will yield a 
RDD[case class]
Any RDD[case class] (RDD[Product]) can be converted into a SchemaRDD by using 
createSchemaRDD method of SQL/HiveContext. This SchemaRDD can then be 
registered as a temp table within Spark SQL through registerTempTable and then 
can be joined to other Spark SQL tables.

This solves the use case of loading data from a JDBC data source, isn't it? Am 
I missing something. Ofcourse this requires Scala and Spark-shell, meaning it 
cant be done from spark-sql or thriftserver2.

Howeer there currently is no easy way of saving a RDD into a JDBC data sink. 
(DbOutputFormat is way too rigid).
This PR, providing a generic mechanism for saving SchemaRDD into a RDBMS table, 
will be very valuable for us.

> Add support for reading from and writing to a JDBC database
> ---
>
> Key: SPARK-5472
> URL: https://issues.apache.org/jira/browse/SPARK-5472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tor Myklebust
>Assignee: Tor Myklebust
>Priority: Blocker
>
> It would be nice to be able to make a table in a JDBC database appear as a 
> table in Spark SQL.  This would let users, for instance, perform a JOIN 
> between a DataFrame in Spark SQL with a table in a Postgres database.
> It might also be nice to be able to go the other direction -- save a 
> DataFrame to a database -- for instance in an ETL job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5505) ConsumerRebalanceFailedException from Kafka consumer

2015-01-30 Thread Greg Temchenko (JIRA)
Greg Temchenko created SPARK-5505:
-

 Summary: ConsumerRebalanceFailedException from Kafka consumer
 Key: SPARK-5505
 URL: https://issues.apache.org/jira/browse/SPARK-5505
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
 Environment: CentOS6 / Linux 2.6.32-358.2.1.el6.x86_64
java version "1.7.0_21"
Scala compiler version 2.9.3
2 cores Intel(R) Xeon(R) CPU E5620  @ 2.40GHz / 16G RAM
VMWare VM.
Reporter: Greg Temchenko
Priority: Critical


>From time to time Spark streaming produces a ConsumerRebalanceFailedException 
>and stops receiving messages. After that all consequential RDDs are empty.

{code}
15/01/30 18:18:36 ERROR consumer.ZookeeperConsumerConnector: 
[terran_vmname-1422670149779-243b4e10], error during syncedRebalance
kafka.common.ConsumerRebalanceFailedException: 
terran_vmname-1422670149779-243b4e10 can't rebalance after 4 retries
at 
kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:432)
at 
kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:355)
{code}

The problem is also described in the mailing list: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-Spark-streaming-consumes-from-Kafka-td19570.html

As I understand it's a critical blocker for kafka-spark streaming production 
use.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5483) java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;

2015-01-30 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299571#comment-14299571
 ] 

DeepakVohra commented on SPARK-5483:


To test:

1. Create a Maven project in Eclipse IDE.
2. Add the Spark MLLib dependency 2.10.


org.apache.spark
spark-mllib_2.10
1.2.0


3. Add a Java class to the Maven project.

public class Test{}

4. Add the following import statements to the Java class Test.

import org.apache.spark.mllib.clustering.KMeans;
import org.apache.spark.mllib.clustering.KMeansModel;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;

> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
> ---
>
> Key: SPARK-5483
> URL: https://issues.apache.org/jira/browse/SPARK-5483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Maven
> Spark 1.2
>Reporter: DeepakVohra
>
> Naive Bayes classifier generates following error.
> ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   at breeze.linalg.DenseVector.(DenseVector.scala:63)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55)
>   at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:112)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:110)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:200)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:199)
>   at 
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:142)
>   at 
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:205)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/01/28 21:50:06 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-0,5,main]
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   at breeze.linalg.DenseVector.(DenseVector.scala:63)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55)
>   at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:11

[jira] [Closed] (SPARK-5299) Is http://www.apache.org/dist/spark/KEYS out of date?

2015-01-30 Thread David Shaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Shaw closed SPARK-5299.
-

Verified that key used to sign release is now present.  Thanks.

> Is http://www.apache.org/dist/spark/KEYS out of date?
> -
>
> Key: SPARK-5299
> URL: https://issues.apache.org/jira/browse/SPARK-5299
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Reporter: David Shaw
>Assignee: Patrick Wendell
>
> The keys contained in http://www.apache.org/dist/spark/KEYS do not appear to 
> match the keys used to sign the releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5504) ScalaReflection.convertToCatalyst should support nested arrays

2015-01-30 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5504.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4295
[https://github.com/apache/spark/pull/4295]

> ScalaReflection.convertToCatalyst should support nested arrays
> --
>
> Key: SPARK-5504
> URL: https://issues.apache.org/jira/browse/SPARK-5504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
> Fix For: 1.3.0
>
>
> After the recent refactoring, convertToCatalyst in ScalaReflection does not 
> recurse on Arrays.  It should.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture

2015-01-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5400.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4290
[https://github.com/apache/spark/pull/4290]

> Rename GaussianMixtureEM to GaussianMixture
> ---
>
> Key: SPARK-5400
> URL: https://issues.apache.org/jira/browse/SPARK-5400
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Travis Galoppo
>Priority: Minor
> Fix For: 1.3.0
>
>
> GaussianMixtureEM is following the old naming convention of including the 
> optimization algorithm name in the class title.  We should probably rename it 
> to GaussianMixture so that it can use other optimization algorithms in the 
> future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5504) ScalaReflection.convertToCatalyst should support nested arrays

2015-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299291#comment-14299291
 ] 

Apache Spark commented on SPARK-5504:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/4295

> ScalaReflection.convertToCatalyst should support nested arrays
> --
>
> Key: SPARK-5504
> URL: https://issues.apache.org/jira/browse/SPARK-5504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> After the recent refactoring, convertToCatalyst in ScalaReflection does not 
> recurse on Arrays.  It should.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4259) Add Power Iteration Clustering Algorithm with Gaussian Similarity Function

2015-01-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4259.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4254
[https://github.com/apache/spark/pull/4254]

> Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
> --
>
> Key: SPARK-4259
> URL: https://issues.apache.org/jira/browse/SPARK-4259
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Fan Jiang
>Assignee: Fan Jiang
>  Labels: features
> Fix For: 1.3.0
>
>
> In recent years, power Iteration clustering has become one of the most 
> popular modern clustering algorithms. It is simple to implement, can be 
> solved efficiently by standard linear algebra software, and very often 
> outperforms traditional clustering algorithms such as the k-means algorithm.
> Power iteration clustering is a scalable and efficient algorithm for 
> clustering points given pointwise mutual affinity values.  Internally the 
> algorithm:
> computes the Gaussian distance between all pairs of points and represents 
> these distances in an Affinity Matrix
> calculates a Normalized Affinity Matrix
> calculates the principal eigenvalue and eigenvector
> Clusters each of the input points according to their principal eigenvector 
> component value
> Details of this algorithm are found within [Power Iteration Clustering, Lin 
> and Cohen]{www.icml2010.org/papers/387.pdf}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5503) Example code for Power Iteration Clustering

2015-01-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5503:
-
Description: 
There are two places we need to put examples:

1. In the user guide, we should be a small example (as in the unit test).
2. Under examples/, we can have something fancy but still need to keep it 
minimal.
3. The user guide contains some out-of-date links, which needs to be updated as 
well.

  was:
There are two places we need to put examples:

1. In the user guide, we should be a small example (as in the unit test).
2. Under examples/, we can have something fancy.


> Example code for Power Iteration Clustering
> ---
>
> Key: SPARK-5503
> URL: https://issues.apache.org/jira/browse/SPARK-5503
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Examples, MLlib
>Reporter: Xiangrui Meng
>Assignee: Stephen Boesch
>
> There are two places we need to put examples:
> 1. In the user guide, we should be a small example (as in the unit test).
> 2. Under examples/, we can have something fancy but still need to keep it 
> minimal.
> 3. The user guide contains some out-of-date links, which needs to be updated 
> as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5486) Add validate function for BlockMatrix

2015-01-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5486.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4279
[https://github.com/apache/spark/pull/4279]

> Add validate function for BlockMatrix
> -
>
> Key: SPARK-5486
> URL: https://issues.apache.org/jira/browse/SPARK-5486
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Burak Yavuz
> Fix For: 1.3.0
>
>
> BlockMatrix needs a validate method to make debugging easy for users. 
> It will be an expensive method to perform, but it would be useful for users 
> to know why `multiply` or `add` didn't work properly.
> Things to validate:
> - MatrixBlocks that are not on the edges should have the dimensions 
> `rowsPerBlock` and `colsPerBlock`.
> - There should be at most one block for each index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5486) Add validate function for BlockMatrix

2015-01-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5486:
-
Assignee: Burak Yavuz

> Add validate function for BlockMatrix
> -
>
> Key: SPARK-5486
> URL: https://issues.apache.org/jira/browse/SPARK-5486
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 1.3.0
>
>
> BlockMatrix needs a validate method to make debugging easy for users. 
> It will be an expensive method to perform, but it would be useful for users 
> to know why `multiply` or `add` didn't work properly.
> Things to validate:
> - MatrixBlocks that are not on the edges should have the dimensions 
> `rowsPerBlock` and `colsPerBlock`.
> - There should be at most one block for each index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5504) ScalaReflection.convertToCatalyst should support nested arrays

2015-01-30 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-5504:


 Summary: ScalaReflection.convertToCatalyst should support nested 
arrays
 Key: SPARK-5504
 URL: https://issues.apache.org/jira/browse/SPARK-5504
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor


After the recent refactoring, convertToCatalyst in ScalaReflection does not 
recurse on Arrays.  It should.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5503) Example code for Power Iteration Clustering

2015-01-30 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5503:


 Summary: Example code for Power Iteration Clustering
 Key: SPARK-5503
 URL: https://issues.apache.org/jira/browse/SPARK-5503
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Examples, MLlib
Reporter: Xiangrui Meng
Assignee: Stephen Boesch


There are two places we need to put examples:

1. In the user guide, we should be a small example (as in the unit test).
2. Under examples/, we can have something fancy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5502) User guide for isotonic regression

2015-01-30 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5502:


 Summary: User guide for isotonic regression
 Key: SPARK-5502
 URL: https://issues.apache.org/jira/browse/SPARK-5502
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: Martin Zapletal


Add user guide to docs/mllib-regression.md with code examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5501) Write support for the data source API

2015-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299207#comment-14299207
 ] 

Apache Spark commented on SPARK-5501:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/4294

> Write support for the data source API
> -
>
> Key: SPARK-5501
> URL: https://issues.apache.org/jira/browse/SPARK-5501
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5501) Write support for the data source API

2015-01-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5501:

Summary: Write support for the data source API  (was: Initial Write support)

> Write support for the data source API
> -
>
> Key: SPARK-5501
> URL: https://issues.apache.org/jira/browse/SPARK-5501
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5500) Document that feeding hadoopFile into a shuffle operation will cause problems

2015-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299199#comment-14299199
 ] 

Apache Spark commented on SPARK-5500:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/4293

> Document that feeding hadoopFile into a shuffle operation will cause problems
> -
>
> Key: SPARK-5500
> URL: https://issues.apache.org/jira/browse/SPARK-5500
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5501) Initial Write support

2015-01-30 Thread Yin Huai (JIRA)
Yin Huai created SPARK-5501:
---

 Summary: Initial Write support
 Key: SPARK-5501
 URL: https://issues.apache.org/jira/browse/SPARK-5501
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5500) Document that feeding hadoopFile into a shuffle operation will cause problems

2015-01-30 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-5500:
-

 Summary: Document that feeding hadoopFile into a shuffle operation 
will cause problems
 Key: SPARK-5500
 URL: https://issues.apache.org/jira/browse/SPARK-5500
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.3.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5197) Support external shuffle service in fine-grained mode on mesos cluster

2015-01-30 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5197:
---
Fix Version/s: (was: 1.3.0)

> Support external shuffle service in fine-grained mode on mesos cluster
> --
>
> Key: SPARK-5197
> URL: https://issues.apache.org/jira/browse/SPARK-5197
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Mesos, Shuffle
>Reporter: Jongyoul Lee
>
> I think dynamic allocation is almost satisfied on mesos' fine-grained mode, 
> which already offers resources dynamically, and returns automatically when a 
> task is finished. It, however, doesn't have a mechanism on support external 
> shuffle service like yarn's way, which is AuxiliaryService. Because mesos 
> doesn't support AusiliaryService, we think a different way to do this.
> - Launching a shuffle service like a spark job on same cluster
> -- Pros
> --- Support multi-tenant environment
> --- Almost same way like yarn
> -- Cons
> --- Control long running 'background' job - service - when mesos runs
> --- Satisfy all slave - or host - to have one shuffle service all the time
> - Launching jobs within shuffle service
> -- Pros
> --- Easy to implement
> --- Don't consider whether shuffle service exists or not.
> -- Cons
> --- exists multiple shuffle services under multi-tenant environment
> --- Control shuffle service port dynamically on multi-user environment
> In my opinion, the first one is better idea to support external shuffle 
> service. Please leave comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2335) k-Nearest Neighbor classification and regression for MLLib

2015-01-30 Thread Brian Gawalt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299093#comment-14299093
 ] 

Brian Gawalt commented on SPARK-2335:
-

I'm inclined to wonder the same thing; generalizing beyond integers would
be nice




> k-Nearest Neighbor classification and regression for MLLib
> --
>
> Key: SPARK-2335
> URL: https://issues.apache.org/jira/browse/SPARK-2335
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Brian Gawalt
>Priority: Minor
>  Labels: features
>
> The k-Nearest Neighbor model for classification and regression problems is a 
> simple and intuitive approach, offering a straightforward path to creating 
> non-linear decision/estimation contours. It's downsides -- high variance 
> (sensitivity to the known training data set) and computational intensity for 
> estimating new point labels -- both play to Spark's big data strengths: lots 
> of data mitigates data concerns; lots of workers mitigate computational 
> latency. 
> We should include kNN models as options in MLLib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4259) Add Power Iteration Clustering Algorithm with Gaussian Similarity Function

2015-01-30 Thread Stephen Boesch (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299092#comment-14299092
 ] 

Stephen Boesch commented on SPARK-4259:
---

Yes the PR has a working version . However Xiangrui has additional significant 
changes that will affect the API - so the recommendation here would be to wait 
until early next week for the dust to settle.

> Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
> --
>
> Key: SPARK-4259
> URL: https://issues.apache.org/jira/browse/SPARK-4259
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Fan Jiang
>Assignee: Fan Jiang
>  Labels: features
>
> In recent years, power Iteration clustering has become one of the most 
> popular modern clustering algorithms. It is simple to implement, can be 
> solved efficiently by standard linear algebra software, and very often 
> outperforms traditional clustering algorithms such as the k-means algorithm.
> Power iteration clustering is a scalable and efficient algorithm for 
> clustering points given pointwise mutual affinity values.  Internally the 
> algorithm:
> computes the Gaussian distance between all pairs of points and represents 
> these distances in an Affinity Matrix
> calculates a Normalized Affinity Matrix
> calculates the principal eigenvalue and eigenvector
> Clusters each of the input points according to their principal eigenvector 
> component value
> Details of this algorithm are found within [Power Iteration Clustering, Lin 
> and Cohen]{www.icml2010.org/papers/387.pdf}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5483) java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;

2015-01-30 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299009#comment-14299009
 ] 

DeepakVohra edited comment on SPARK-5483 at 1/30/15 7:09 PM:
-

Thanks for the clarification about why the master URL is not set.

Is the Maven dependency >spark-mllib_2.10 not a Spark issue?


org.apache.spark
spark-mllib_2.10
1.2.0


There must be some issue in how you are adding the classes to the classpath.

Modifying "2.10" to "2.11" fixes the issue and the org.apache.spark.mllib.* 
packages get found, but introduces the Scala version issue.



was (Author: dvohra):
Is the Maven dependency >spark-mllib_2.10 not a Spark issue?


org.apache.spark
spark-mllib_2.10
1.2.0


There must be some issue in how you are adding the classes to the classpath.

Modifying "2.10" to "2.11" fixes the issue and the org.apache.spark.mllib.* 
packages get found, but introduces the Scala version issue.


> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
> ---
>
> Key: SPARK-5483
> URL: https://issues.apache.org/jira/browse/SPARK-5483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Maven
> Spark 1.2
>Reporter: DeepakVohra
>
> Naive Bayes classifier generates following error.
> ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   at breeze.linalg.DenseVector.(DenseVector.scala:63)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55)
>   at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:112)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:110)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:200)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:199)
>   at 
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:142)
>   at 
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:205)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/01/28 21:50:06 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-0,5,main]
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.

[jira] [Commented] (SPARK-4259) Add Power Iteration Clustering Algorithm with Gaussian Similarity Function

2015-01-30 Thread Andrew Musselman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299065#comment-14299065
 ] 

Andrew Musselman commented on SPARK-4259:
-

Makes sense; does that pull request contain a working version?

> Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
> --
>
> Key: SPARK-4259
> URL: https://issues.apache.org/jira/browse/SPARK-4259
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Fan Jiang
>Assignee: Fan Jiang
>  Labels: features
>
> In recent years, power Iteration clustering has become one of the most 
> popular modern clustering algorithms. It is simple to implement, can be 
> solved efficiently by standard linear algebra software, and very often 
> outperforms traditional clustering algorithms such as the k-means algorithm.
> Power iteration clustering is a scalable and efficient algorithm for 
> clustering points given pointwise mutual affinity values.  Internally the 
> algorithm:
> computes the Gaussian distance between all pairs of points and represents 
> these distances in an Affinity Matrix
> calculates a Normalized Affinity Matrix
> calculates the principal eigenvalue and eigenvector
> Clusters each of the input points according to their principal eigenvector 
> component value
> Details of this algorithm are found within [Power Iteration Clustering, Lin 
> and Cohen]{www.icml2010.org/papers/387.pdf}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds

2015-01-30 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1517:
---
Target Version/s: 1.4.0

> Publish nightly snapshots of documentation, maven artifacts, and binary builds
> --
>
> Key: SPARK-1517
> URL: https://issues.apache.org/jira/browse/SPARK-1517
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Reporter: Patrick Wendell
>Priority: Blocker
>
> Should be pretty easy to do with Jenkins. The only thing I can think of that 
> would be tricky is to set up credentials so that jenkins can publish this 
> stuff somewhere on apache infra.
> Ideally we don't want to have to put a private key on every jenkins box 
> (since they are otherwise pretty stateless). One idea is to encrypt these 
> credentials with a passphrase and post them somewhere publicly visible. Then 
> the jenkins build can download the credentials provided we set a passphrase 
> in an environment variable in jenkins. There may be simpler solutions as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds

2015-01-30 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1517:
---
Target Version/s:   (was: 1.3.0)

> Publish nightly snapshots of documentation, maven artifacts, and binary builds
> --
>
> Key: SPARK-1517
> URL: https://issues.apache.org/jira/browse/SPARK-1517
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Reporter: Patrick Wendell
>Priority: Blocker
>
> Should be pretty easy to do with Jenkins. The only thing I can think of that 
> would be tricky is to set up credentials so that jenkins can publish this 
> stuff somewhere on apache infra.
> Ideally we don't want to have to put a private key on every jenkins box 
> (since they are otherwise pretty stateless). One idea is to encrypt these 
> credentials with a passphrase and post them somewhere publicly visible. Then 
> the jenkins build can download the credentials provided we set a passphrase 
> in an environment variable in jenkins. There may be simpler solutions as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5483) java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;

2015-01-30 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299009#comment-14299009
 ] 

DeepakVohra edited comment on SPARK-5483 at 1/30/15 6:31 PM:
-

Is the Maven dependency >spark-mllib_2.10 not a Spark issue?


org.apache.spark
spark-mllib_2.10
1.2.0


There must be some issue in how you are adding the classes to the classpath.

Modifying "2.10" to "2.11" fixes the issue and the org.apache.spark.mllib.* 
packages get found, but introduces the Scala version issue.



was (Author: dvohra):
Is the Maven dependency >spark-mllib_2.10 not a Spark issue?


org.apache.spark
spark-mllib_2.10
1.2.0


There must be some issue in how you are adding the classes to the classpath.

Modifying "2.10" to "2.11" fixes the issue of the org.apache.spark.mllib.* 
packages being found.


> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
> ---
>
> Key: SPARK-5483
> URL: https://issues.apache.org/jira/browse/SPARK-5483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Maven
> Spark 1.2
>Reporter: DeepakVohra
>
> Naive Bayes classifier generates following error.
> ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   at breeze.linalg.DenseVector.(DenseVector.scala:63)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55)
>   at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:112)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:110)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:200)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:199)
>   at 
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:142)
>   at 
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:205)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/01/28 21:50:06 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-0,5,main]
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   

[jira] [Comment Edited] (SPARK-5483) java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;

2015-01-30 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299009#comment-14299009
 ] 

DeepakVohra edited comment on SPARK-5483 at 1/30/15 6:30 PM:
-

Is the Maven dependency >spark-mllib_2.10 not a Spark issue?


org.apache.spark
spark-mllib_2.10
1.2.0


There must be some issue in how you are adding the classes to the classpath.

Modifying "2.10" to "2.11" fixes the issue of the org.apache.spark.mllib.* 
packages being found.



was (Author: dvohra):
Is the Maven dependency >spark-mllib_2.10 not a Spark issue?


org.apache.spark
spark-mllib_2.10
1.2.0


There must be some issue in how you are adding the classes to the classpath.

Modifying "2.10" to "2.11" fixes the issue.


> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
> ---
>
> Key: SPARK-5483
> URL: https://issues.apache.org/jira/browse/SPARK-5483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Maven
> Spark 1.2
>Reporter: DeepakVohra
>
> Naive Bayes classifier generates following error.
> ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   at breeze.linalg.DenseVector.(DenseVector.scala:63)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55)
>   at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:112)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:110)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:200)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:199)
>   at 
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:142)
>   at 
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:205)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/01/28 21:50:06 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-0,5,main]
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   at breeze.linalg.DenseVector.(DenseVector.scala:63)
>   at breeze.linalg.DenseVector$

[jira] [Commented] (SPARK-5483) java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;

2015-01-30 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299009#comment-14299009
 ] 

DeepakVohra commented on SPARK-5483:


Is the Maven dependency >spark-mllib_2.10 not a Spark issue?


org.apache.spark
spark-mllib_2.10
1.2.0


There must be some issue in how you are adding the classes to the classpath.

Modifying "2.10" to "2.11" fixes the issue.


> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
> ---
>
> Key: SPARK-5483
> URL: https://issues.apache.org/jira/browse/SPARK-5483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Maven
> Spark 1.2
>Reporter: DeepakVohra
>
> Naive Bayes classifier generates following error.
> ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   at breeze.linalg.DenseVector.(DenseVector.scala:63)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55)
>   at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:112)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:110)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:200)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:199)
>   at 
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:142)
>   at 
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:205)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/01/28 21:50:06 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-0,5,main]
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   at breeze.linalg.DenseVector.(DenseVector.scala:63)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55)
>   at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:112)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:110)
>   at 
> org.apache.spark.util.collection.ExternalSorter

[jira] [Commented] (SPARK-5489) KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create (I)Lscala/runtime/IntRef;

2015-01-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299006#comment-14299006
 ] 

Sean Owen commented on SPARK-5489:
--

But that *is* the artifact, which you see has the class. It must be an issue in 
your classpath, right?

> KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create  
> (I)Lscala/runtime/IntRef;
> -
>
> Key: SPARK-5489
> URL: https://issues.apache.org/jira/browse/SPARK-5489
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Spark 1.2 
> Maven
>Reporter: DeepakVohra
>
> The KMeans clustering generates following error, which also seems to be due 
> version mismatch between Scala used for compiling Spark and Scala in Spark 
> 1.2 Maven dependency. 
> Exception in thread "main" java.lang.NoSuchMethodError: 
> scala.runtime.IntRef.create
> (I)Lscala/runtime/IntRef;
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:282)
>   at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:155)
>   at 
> org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:132)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:352)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:362)
>   at 
> org.apache.spark.mllib.clustering.KMeans.train(KMeans.scala)
>   at 
> clusterer.kmeans.KMeansClusterer.main(KMeansClusterer.java:35)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5489) KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create (I)Lscala/runtime/IntRef;

2015-01-30 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299001#comment-14299001
 ] 

DeepakVohra commented on SPARK-5489:


Already did before posting the previous message and the jar does have the 
classes, but are indicated as not found with the Maven dependency. Gets fixed 
with MLLib 2.11. The Maven dependency MLlib 2.10 has some issue.

> KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create  
> (I)Lscala/runtime/IntRef;
> -
>
> Key: SPARK-5489
> URL: https://issues.apache.org/jira/browse/SPARK-5489
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Spark 1.2 
> Maven
>Reporter: DeepakVohra
>
> The KMeans clustering generates following error, which also seems to be due 
> version mismatch between Scala used for compiling Spark and Scala in Spark 
> 1.2 Maven dependency. 
> Exception in thread "main" java.lang.NoSuchMethodError: 
> scala.runtime.IntRef.create
> (I)Lscala/runtime/IntRef;
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:282)
>   at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:155)
>   at 
> org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:132)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:352)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:362)
>   at 
> org.apache.spark.mllib.clustering.KMeans.train(KMeans.scala)
>   at 
> clusterer.kmeans.KMeansClusterer.main(KMeansClusterer.java:35)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError: Requested array size exceeds VM limit"

2015-01-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4846.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4247
[https://github.com/apache/spark/pull/4247]

> When the vocabulary size is large, Word2Vec may yield "OutOfMemoryError: 
> Requested array size exceeds VM limit"
> ---
>
> Key: SPARK-4846
> URL: https://issues.apache.org/jira/browse/SPARK-4846
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.1, 1.2.0
> Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
> partition.
> The corpus contains about 300 million words and its vocabulary size is about 
> 10 million.
>Reporter: Joseph Tang
>Assignee: Joseph Tang
>Priority: Minor
> Fix For: 1.3.0
>
>
> Exception in thread "Driver" java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
> Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
> at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
> at 
> org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5496) Allow both "classification" and "Classification" in Algo for trees

2015-01-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5496.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4287
[https://github.com/apache/spark/pull/4287]

> Allow both "classification" and "Classification" in Algo for trees
> --
>
> Key: SPARK-5496
> URL: https://issues.apache.org/jira/browse/SPARK-5496
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.3.0
>
>
> We use "classification" in tree but "Classification" in boosting. We switched 
> to "classification" in both cases, but still need to accept "Classification" 
> to be backward compatible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5393) Flood of util.RackResolver log messages after SPARK-1714

2015-01-30 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-5393.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

> Flood of util.RackResolver log messages after SPARK-1714
> 
>
> Key: SPARK-5393
> URL: https://issues.apache.org/jira/browse/SPARK-5393
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>Priority: Critical
> Fix For: 1.3.0
>
>
> I thought I fixed this while working on the patch, but [~laserson] seems to 
> have encountered it when running on master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3778) newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn

2015-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298872#comment-14298872
 ] 

Apache Spark commented on SPARK-3778:
-

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/4292

> newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn
> -
>
> Key: SPARK-3778
> URL: https://issues.apache.org/jira/browse/SPARK-3778
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
>
> The newAPIHadoopRDD routine doesn't properly add the credentials to the conf 
> to be able to access secure hdfs.
> Note that newAPIHadoopFile does handle these because the 
> org.apache.hadoop.mapreduce.Job automatically adds it for you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5485) typo in spark streaming configuration parameter

2015-01-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5485.
--
Resolution: Duplicate

> typo in spark streaming configuration parameter
> ---
>
> Key: SPARK-5485
> URL: https://issues.apache.org/jira/browse/SPARK-5485
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.2.0
>Reporter: Wing Yew Poon
>
> In 
> https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#deploying-applications,
>  under "Requirements", the bullet point on "Configuring write ahead logs" says
> "This can be enabled by setting the configuration parameter 
> spark.streaming.receiver.writeAheadLogs.enable to true."
> There is an unfortunate typo in the name of the parameter, which I 
> copied-and-pasted into my deployment where I was testing it out and seeing 
> data loss as a result. 
> The same typo occurs in 
> https://spark.apache.org/docs/1.2.0/configuration.html, which is even more 
> unfortunate.
> Documentation should not have typos like this for configuration parameters. I 
> later found the correct parameter on 
> http://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5491) Chi-square feature selection

2015-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298855#comment-14298855
 ] 

Apache Spark commented on SPARK-5491:
-

User 'avulanov' has created a pull request for this issue:
https://github.com/apache/spark/pull/1484

> Chi-square feature selection
> 
>
> Key: SPARK-5491
> URL: https://issues.apache.org/jira/browse/SPARK-5491
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Alexander Ulanov
>
> Implement chi-square feature selection. PR: 
> https://github.com/apache/spark/pull/1484



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5483) java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;

2015-01-30 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298823#comment-14298823
 ] 

DeepakVohra edited comment on SPARK-5483 at 1/30/15 4:41 PM:
-

The issue is with the Maven dependency

org.apache.spark
spark-mllib_2.10
1.2.0

 
spark-mllib_2.10 does not include  the org.apache.spark.mllib.* packages, which 
it should according to 
http://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.10/1.2.0

Generates error at import statements:
import org.apache.spark.mllib.classification.NaiveBayes;
import org.apache.spark.mllib.classification.NaiveBayesModel;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;


"The import org.apache.spark.mllib cannot be resolved".

The 2.11 version spark-mllib_2.11 fixes the error but seems to be referring 
Scala 2.11.


was (Author: dvohra):
The issue is with the Maven dependency

org.apache.spark
spark-mllib_2.10
1.2.0

 
spark-mllib_2.10 does not include  the org.apache.spark.mllib.clustering 
packages, which it should according to the Maven 
http://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.10/1.2.0

Generates error at import statements:
import org.apache.spark.mllib.classification.NaiveBayes;
import org.apache.spark.mllib.classification.NaiveBayesModel;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;


"The import org.apache.spark.mllib cannot be resolved".

The 2.11 version spark-mllib_2.11 fixes the error but seems to be referring 
Scala 2.11.

> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
> ---
>
> Key: SPARK-5483
> URL: https://issues.apache.org/jira/browse/SPARK-5483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Maven
> Spark 1.2
>Reporter: DeepakVohra
>
> Naive Bayes classifier generates following error.
> ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   at breeze.linalg.DenseVector.(DenseVector.scala:63)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55)
>   at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:112)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:110)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:200)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:199)
>   at 
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:142)
>   at 
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:205)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/01/28 21:50:06 ERROR SparkUncaughtExce

[jira] [Resolved] (SPARK-5483) java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;

2015-01-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5483.
--
Resolution: Not a Problem

These too are clearly in the 2.10 artifact. Download it and grep for them. 
https://repo1.maven.org/maven2/org/apache/spark/spark-mllib_2.10/1.2.0/spark-mllib_2.10-1.2.0.jar
  There must be some issue in how you are adding the classes to the classpath.

You definitely can't mix Scala versions, and the classes are where they should 
be, so this should be resolved. I think further questions should go to the 
mailing list on this one, until it's clear there is a Spark issue.

> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
> ---
>
> Key: SPARK-5483
> URL: https://issues.apache.org/jira/browse/SPARK-5483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Maven
> Spark 1.2
>Reporter: DeepakVohra
>
> Naive Bayes classifier generates following error.
> ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   at breeze.linalg.DenseVector.(DenseVector.scala:63)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55)
>   at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:112)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:110)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:200)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:199)
>   at 
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:142)
>   at 
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:205)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/01/28 21:50:06 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-0,5,main]
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   at breeze.linalg.DenseVector.(DenseVector.scala:63)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55)
>   at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:112)
>   at 
> org.apache.spark.

[jira] [Commented] (SPARK-5489) KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create (I)Lscala/runtime/IntRef;

2015-01-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298840#comment-14298840
 ] 

Sean Owen commented on SPARK-5489:
--

No, the 2.10 artifact plainly has these classes. Download it ( 
https://repo1.maven.org/maven2/org/apache/spark/spark-mllib_2.10/1.2.0/spark-mllib_2.10-1.2.0.jar
 ) and grep for them. 

> KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create  
> (I)Lscala/runtime/IntRef;
> -
>
> Key: SPARK-5489
> URL: https://issues.apache.org/jira/browse/SPARK-5489
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Spark 1.2 
> Maven
>Reporter: DeepakVohra
>
> The KMeans clustering generates following error, which also seems to be due 
> version mismatch between Scala used for compiling Spark and Scala in Spark 
> 1.2 Maven dependency. 
> Exception in thread "main" java.lang.NoSuchMethodError: 
> scala.runtime.IntRef.create
> (I)Lscala/runtime/IntRef;
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:282)
>   at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:155)
>   at 
> org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:132)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:352)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:362)
>   at 
> org.apache.spark.mllib.clustering.KMeans.train(KMeans.scala)
>   at 
> clusterer.kmeans.KMeansClusterer.main(KMeansClusterer.java:35)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5489) KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create (I)Lscala/runtime/IntRef;

2015-01-30 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298824#comment-14298824
 ] 

DeepakVohra edited comment on SPARK-5489 at 1/30/15 4:40 PM:
-

The issue is with the Maven dependency

org.apache.spark
spark-mllib_2.10
1.2.0

 
spark-mllib_2.10 does not include  the org.apache.spark.mllib.* packages, which 
it should according to 
http://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.10/1.2.0

Generates error at import statements:
import org.apache.spark.mllib.clustering.KMeans;
import org.apache.spark.mllib.clustering.KMeansModel;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;

"The import org.apache.spark.mllib cannot be resolved".

The 2.11 version spark-mllib_2.11 fixes the error but seems to be referring 
Scala 2.11.


was (Author: dvohra):
The issue is with the Maven dependency

org.apache.spark
spark-mllib_2.10
1.2.0

 
spark-mllib_2.10 does not include  the org.apache.spark.mllib.* packages, which 
it should according to the Maven 
http://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.10/1.2.0

Generates error at import statements:
import org.apache.spark.mllib.clustering.KMeans;
import org.apache.spark.mllib.clustering.KMeansModel;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;

"The import org.apache.spark.mllib cannot be resolved".

The 2.11 version spark-mllib_2.11 fixes the error but seems to be referring 
Scala 2.11.

> KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create  
> (I)Lscala/runtime/IntRef;
> -
>
> Key: SPARK-5489
> URL: https://issues.apache.org/jira/browse/SPARK-5489
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Spark 1.2 
> Maven
>Reporter: DeepakVohra
>
> The KMeans clustering generates following error, which also seems to be due 
> version mismatch between Scala used for compiling Spark and Scala in Spark 
> 1.2 Maven dependency. 
> Exception in thread "main" java.lang.NoSuchMethodError: 
> scala.runtime.IntRef.create
> (I)Lscala/runtime/IntRef;
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:282)
>   at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:155)
>   at 
> org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:132)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:352)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:362)
>   at 
> org.apache.spark.mllib.clustering.KMeans.train(KMeans.scala)
>   at 
> clusterer.kmeans.KMeansClusterer.main(KMeansClusterer.java:35)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5489) KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create (I)Lscala/runtime/IntRef;

2015-01-30 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298824#comment-14298824
 ] 

DeepakVohra edited comment on SPARK-5489 at 1/30/15 4:40 PM:
-

The issue is with the Maven dependency

org.apache.spark
spark-mllib_2.10
1.2.0

 
spark-mllib_2.10 does not include  the org.apache.spark.mllib.* packages, which 
it should according to the Maven 
http://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.10/1.2.0

Generates error at import statements:
import org.apache.spark.mllib.clustering.KMeans;
import org.apache.spark.mllib.clustering.KMeansModel;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;

"The import org.apache.spark.mllib cannot be resolved".

The 2.11 version spark-mllib_2.11 fixes the error but seems to be referring 
Scala 2.11.


was (Author: dvohra):
The issue is with the Maven dependency

org.apache.spark
spark-mllib_2.10
1.2.0

 
spark-mllib_2.10 does not include  the org.apache.spark.mllib.clustering 
packages, which it should according to the Maven 
http://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.10/1.2.0

Generates error at import statements:
import org.apache.spark.mllib.clustering.KMeans;
import org.apache.spark.mllib.clustering.KMeansModel;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;

"The import org.apache.spark.mllib cannot be resolved".

The 2.11 version spark-mllib_2.11 fixes the error but seems to be referring 
Scala 2.11.

> KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create  
> (I)Lscala/runtime/IntRef;
> -
>
> Key: SPARK-5489
> URL: https://issues.apache.org/jira/browse/SPARK-5489
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Spark 1.2 
> Maven
>Reporter: DeepakVohra
>
> The KMeans clustering generates following error, which also seems to be due 
> version mismatch between Scala used for compiling Spark and Scala in Spark 
> 1.2 Maven dependency. 
> Exception in thread "main" java.lang.NoSuchMethodError: 
> scala.runtime.IntRef.create
> (I)Lscala/runtime/IntRef;
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:282)
>   at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:155)
>   at 
> org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:132)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:352)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:362)
>   at 
> org.apache.spark.mllib.clustering.KMeans.train(KMeans.scala)
>   at 
> clusterer.kmeans.KMeansClusterer.main(KMeansClusterer.java:35)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5483) java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;

2015-01-30 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298823#comment-14298823
 ] 

DeepakVohra edited comment on SPARK-5483 at 1/30/15 4:39 PM:
-

The issue is with the Maven dependency

org.apache.spark
spark-mllib_2.10
1.2.0

 
spark-mllib_2.10 does not include  the org.apache.spark.mllib.clustering 
packages, which it should according to the Maven 
http://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.10/1.2.0

Generates error at import statements:
import org.apache.spark.mllib.classification.NaiveBayes;
import org.apache.spark.mllib.classification.NaiveBayesModel;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;


"The import org.apache.spark.mllib cannot be resolved".

The 2.11 version spark-mllib_2.11 fixes the error but seems to be referring 
Scala 2.11.


was (Author: dvohra):
The issue is with the Maven dependency

org.apache.spark
spark-mllib_2.10
1.2.0

 
spark-mllib_2.10 does not include  the org.apache.spark.mllib.clustering 
packages, which it should according to the Maven 
http://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.10/1.2.0

Generates error at import statements:
import org.apache.spark.mllib.clustering.KMeans;
import org.apache.spark.mllib.clustering.KMeansModel;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;

"The import org.apache.spark.mllib cannot be resolved".

The 2.11 version spark-mllib_2.11 fixes the error but seems to be referring 
Scala 2.11.

> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
> ---
>
> Key: SPARK-5483
> URL: https://issues.apache.org/jira/browse/SPARK-5483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Maven
> Spark 1.2
>Reporter: DeepakVohra
>
> Naive Bayes classifier generates following error.
> ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   at breeze.linalg.DenseVector.(DenseVector.scala:63)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55)
>   at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:112)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:110)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:200)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:199)
>   at 
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:142)
>   at 
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:205)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/01/28 21:50:06 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread

[jira] [Commented] (SPARK-5489) KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create (I)Lscala/runtime/IntRef;

2015-01-30 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298824#comment-14298824
 ] 

DeepakVohra commented on SPARK-5489:


The issue is with the Maven dependency

org.apache.spark
spark-mllib_2.10
1.2.0

 
spark-mllib_2.10 does not include  the org.apache.spark.mllib.clustering 
packages, which it should according to the Maven 
http://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.10/1.2.0

Generates error at import statements:
import org.apache.spark.mllib.clustering.KMeans;
import org.apache.spark.mllib.clustering.KMeansModel;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;

"The import org.apache.spark.mllib cannot be resolved".

The 2.11 version spark-mllib_2.11 fixes the error but seems to be referring 
Scala 2.11.

> KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create  
> (I)Lscala/runtime/IntRef;
> -
>
> Key: SPARK-5489
> URL: https://issues.apache.org/jira/browse/SPARK-5489
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Spark 1.2 
> Maven
>Reporter: DeepakVohra
>
> The KMeans clustering generates following error, which also seems to be due 
> version mismatch between Scala used for compiling Spark and Scala in Spark 
> 1.2 Maven dependency. 
> Exception in thread "main" java.lang.NoSuchMethodError: 
> scala.runtime.IntRef.create
> (I)Lscala/runtime/IntRef;
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:282)
>   at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:155)
>   at 
> org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:132)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:352)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:362)
>   at 
> org.apache.spark.mllib.clustering.KMeans.train(KMeans.scala)
>   at 
> clusterer.kmeans.KMeansClusterer.main(KMeansClusterer.java:35)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5483) java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;

2015-01-30 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298823#comment-14298823
 ] 

DeepakVohra commented on SPARK-5483:


The issue is with the Maven dependency

org.apache.spark
spark-mllib_2.10
1.2.0

 
spark-mllib_2.10 does not include  the org.apache.spark.mllib.clustering 
packages, which it should according to the Maven 
http://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.10/1.2.0

Generates error at import statements:
import org.apache.spark.mllib.clustering.KMeans;
import org.apache.spark.mllib.clustering.KMeansModel;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;

"The import org.apache.spark.mllib cannot be resolved".

The 2.11 version spark-mllib_2.11 fixes the error but seems to be referring 
Scala 2.11.

> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
> ---
>
> Key: SPARK-5483
> URL: https://issues.apache.org/jira/browse/SPARK-5483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Maven
> Spark 1.2
>Reporter: DeepakVohra
>
> Naive Bayes classifier generates following error.
> ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   at breeze.linalg.DenseVector.(DenseVector.scala:63)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55)
>   at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:112)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:110)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:200)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:199)
>   at 
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:142)
>   at 
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:205)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/01/28 21:50:06 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-0,5,main]
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   at breeze.linalg.DenseVector.(DenseVector.scala:63)
>   at breeze.linalg.DenseVector$mcD$sp.(De

[jira] [Resolved] (SPARK-5267) Add a streaming module to ingest Apache Camel Messages from a configured endpoints

2015-01-30 Thread Steve Brewin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Brewin resolved SPARK-5267.
-
Resolution: Done

Code submitted to Spark Packages @ http://spark-packages.org/package/29,
homepage https://github.com/synsys/spark

> Add a streaming module to ingest Apache Camel Messages from a configured 
> endpoints
> --
>
> Key: SPARK-5267
> URL: https://issues.apache.org/jira/browse/SPARK-5267
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Steve Brewin
>  Labels: features
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> The number of input stream protocols supported by Spark Streaming is quite 
> limited, which constrains the number of systems with which it can be 
> integrated.
> This proposal solves the problem by adding an optional module that integrates 
> Apache Camel, which supports many additional input protocols. Our tried and 
> tested implementation of this proposal is "spark-streaming-camel". 
> An Apache Camel service is run on a separate Thread, consuming each 
> http://camel.apache.org/maven/current/camel-core/apidocs/org/apache/camel/Message.html
>  and storing it into Spark's memory. The provider of the Message is specified 
> by any consuming component URI documented at 
> http://camel.apache.org/components.html, making all of these protocols 
> available to Spark Streaming.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture

2015-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298762#comment-14298762
 ] 

Apache Spark commented on SPARK-5400:
-

User 'tgaloppo' has created a pull request for this issue:
https://github.com/apache/spark/pull/4290

> Rename GaussianMixtureEM to GaussianMixture
> ---
>
> Key: SPARK-5400
> URL: https://issues.apache.org/jira/browse/SPARK-5400
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Travis Galoppo
>Priority: Minor
>
> GaussianMixtureEM is following the old naming convention of including the 
> optimization algorithm name in the class title.  We should probably rename it 
> to GaussianMixture so that it can use other optimization algorithms in the 
> future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3454) Expose JSON representation of data shown in WebUI

2015-01-30 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298747#comment-14298747
 ] 

Imran Rashid commented on SPARK-3454:
-

design doc attached, would love any feedback

> Expose JSON representation of data shown in WebUI
> -
>
> Key: SPARK-3454
> URL: https://issues.apache.org/jira/browse/SPARK-3454
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
> Attachments: sparkmonitoringjsondesign.pdf
>
>
> If WebUI support to JSON format extracting, it's helpful for user who want to 
> analyse stage / task / executor information.
> Fortunately, WebUI has renderJson method so we can implement the method in 
> each subclass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3454) Expose JSON representation of data shown in WebUI

2015-01-30 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-3454:

Attachment: sparkmonitoringjsondesign.pdf

> Expose JSON representation of data shown in WebUI
> -
>
> Key: SPARK-3454
> URL: https://issues.apache.org/jira/browse/SPARK-3454
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
> Attachments: sparkmonitoringjsondesign.pdf
>
>
> If WebUI support to JSON format extracting, it's helpful for user who want to 
> analyse stage / task / executor information.
> Fortunately, WebUI has renderJson method so we can implement the method in 
> each subclass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5498) [SPARK-SQL]when the partition schema does not match table schema,it throws java.lang.ClassCastException and so on

2015-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298648#comment-14298648
 ] 

Apache Spark commented on SPARK-5498:
-

User 'jeanlyn' has created a pull request for this issue:
https://github.com/apache/spark/pull/4289

> [SPARK-SQL]when the partition schema does not match table schema,it throws 
> java.lang.ClassCastException and so on
> -
>
> Key: SPARK-5498
> URL: https://issues.apache.org/jira/browse/SPARK-5498
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: jeanlyn
>
> when the partition schema does not match table schema,it will thows exception 
> when the task is running.For example,we modify the type of column from int to 
> bigint by the sql *ALTER TABLE table_with_partition CHANGE COLUMN key key 
> BIGINT* ,then we query the patition data which was stored before the 
> changing,we would get the exception:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 27.0 failed 4 times, most recent failure: Lost task 0.3 in stage 27.0 
> (TID 30, BJHC-HADOOP-HERA-16950.jeanlyn.local): java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.MutableInt
> at 
> org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:241)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$13$$anonfun$apply$4.apply(TableReader.scala:286)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$13$$anonfun$apply$4.apply(TableReader.scala:286)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:322)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at 
> org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
> at 
> org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
> at 
> org.ap

[jira] [Comment Edited] (SPARK-5499) iterative computing with 1000 iterations causes stage failure

2015-01-30 Thread Tien-Dung LE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298622#comment-14298622
 ] 

Tien-Dung LE edited comment on SPARK-5499 at 1/30/15 1:47 PM:
--

I tried with checkpoint() but had the same error. Here is the code

{code}
for (i <- 1 to 1000) {
  newPair = pair.map(_.swap).persist()

  pair = newPair
  println("" + i + ": count = " + pair.count())

  if( i % 100 == 0) {
pair.checkpoint()
  }
}
{code}


was (Author: tien-dung.le):
I tried with checkpoint() but same had the same error. Here is the code

{code}
for (i <- 1 to 1000) {
  newPair = pair.map(_.swap).persist()

  pair = newPair
  println("" + i + ": count = " + pair.count())

  if( i % 100 == 0) {
pair.checkpoint()
  }
}
{code}

> iterative computing with 1000 iterations causes stage failure
> -
>
> Key: SPARK-5499
> URL: https://issues.apache.org/jira/browse/SPARK-5499
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Tien-Dung LE
>
> I got an error "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task serialization failed: java.lang.StackOverflowError" when 
> executing an action with 1000 transformations.
> Here is a code snippet to re-produce the error:
> {code}
>   import org.apache.spark.rdd.RDD
>   var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
> var newPair: RDD[(Long,Long)] = null
> for (i <- 1 to 1000) {
>   newPair = pair.map(_.swap)
>   pair = newPair
> }
> println("Count = " + pair.count())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5499) iterative computing with 1000 iterations causes stage failure

2015-01-30 Thread Tien-Dung LE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298622#comment-14298622
 ] 

Tien-Dung LE commented on SPARK-5499:
-

I tried with checkpoint() but same had the same error. Here is the code

{code}
for (i <- 1 to 1000) {
  newPair = pair.map(_.swap).persist()

  pair = newPair
  println("" + i + ": count = " + pair.count())

  if( i % 100 == 0) {
pair.checkpoint()
  }
}
{code}

> iterative computing with 1000 iterations causes stage failure
> -
>
> Key: SPARK-5499
> URL: https://issues.apache.org/jira/browse/SPARK-5499
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Tien-Dung LE
>
> I got an error "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task serialization failed: java.lang.StackOverflowError" when 
> executing an action with 1000 transformations.
> Here is a code snippet to re-produce the error:
> {code}
>   import org.apache.spark.rdd.RDD
>   var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
> var newPair: RDD[(Long,Long)] = null
> for (i <- 1 to 1000) {
>   newPair = pair.map(_.swap)
>   pair = newPair
> }
> println("Count = " + pair.count())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5428) Declare the 'assembly' module at the bottom of the element in the parent POM

2015-01-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5428.
--
Resolution: Won't Fix

> Declare the 'assembly' module at the bottom of the  element in the 
> parent POM
> --
>
> Key: SPARK-5428
> URL: https://issues.apache.org/jira/browse/SPARK-5428
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Deploy
>Reporter: Christian Tzolov
>Priority: Trivial
>  Labels: assembly, maven, pom
>
> For multiple-modules projects, Maven follows those execution order rules:
> http://maven.apache.org/guides/mini/guide-multiple-modules.html
> If no explicit dependencies are declared Maven will follow the order declared 
> in the  element.  
> Because the 'assembly' module is responsible to aggregate build artifacts 
> from other modules/project it make sense to be run last in the execution 
> chain. 
> At the moment the 'assembly' stays before modules like 'examples' which makes 
> it impossible to generate DEP package that contains the examples jar. 
> IMHO the 'assembly' needs to be kept at the bottom of the  list.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5499) iterative computing with 1000 iterations causes stage failure

2015-01-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298612#comment-14298612
 ] 

Sean Owen commented on SPARK-5499:
--

Ah, that may be right. persist() should also break the lineage, but here you'd 
still be computing the whole lineage all at once from the start before anything 
can persist. Yes, how about checkpoint()?

> iterative computing with 1000 iterations causes stage failure
> -
>
> Key: SPARK-5499
> URL: https://issues.apache.org/jira/browse/SPARK-5499
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Tien-Dung LE
>
> I got an error "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task serialization failed: java.lang.StackOverflowError" when 
> executing an action with 1000 transformations.
> Here is a code snippet to re-produce the error:
> {code}
>   import org.apache.spark.rdd.RDD
>   var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
> var newPair: RDD[(Long,Long)] = null
> for (i <- 1 to 1000) {
>   newPair = pair.map(_.swap)
>   pair = newPair
> }
> println("Count = " + pair.count())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5499) iterative computing with 1000 iterations causes stage failure

2015-01-30 Thread Tien-Dung LE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298608#comment-14298608
 ] 

Tien-Dung LE commented on SPARK-5499:
-

Thanks Sean Owen for your comment.

Calling persist() or cache() does not help. Did you mean to call checkpoint() ?

> iterative computing with 1000 iterations causes stage failure
> -
>
> Key: SPARK-5499
> URL: https://issues.apache.org/jira/browse/SPARK-5499
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Tien-Dung LE
>
> I got an error "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task serialization failed: java.lang.StackOverflowError" when 
> executing an action with 1000 transformations.
> Here is a code snippet to re-produce the error:
> {code}
>   import org.apache.spark.rdd.RDD
>   var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
> var newPair: RDD[(Long,Long)] = null
> for (i <- 1 to 1000) {
>   newPair = pair.map(_.swap)
>   pair = newPair
> }
> println("Count = " + pair.count())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5499) iterative computing with 1000 iterations causes stage failure

2015-01-30 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-5499:

Description: 
I got an error "org.apache.spark.SparkException: Job aborted due to stage 
failure: Task serialization failed: java.lang.StackOverflowError" when 
executing an action with 1000 transformations.

Here is a code snippet to re-produce the error:
{code}
  import org.apache.spark.rdd.RDD
  var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
var newPair: RDD[(Long,Long)] = null

for (i <- 1 to 1000) {
  newPair = pair.map(_.swap)
  pair = newPair
}

println("Count = " + pair.count())

{code}

  was:
I got an error "org.apache.spark.SparkException: Job aborted due to stage 
failure: Task serialization failed: java.lang.StackOverflowError" when 
executing an action with 1000 transformations.

Here is a code snippet to re-produce the error:

  import org.apache.spark.rdd.RDD
  var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
var newPair: RDD[(Long,Long)] = null

for (i <- 1 to 1000) {
  newPair = pair.map(_.swap)
  pair = newPair
}

println("Count = " + pair.count())



> iterative computing with 1000 iterations causes stage failure
> -
>
> Key: SPARK-5499
> URL: https://issues.apache.org/jira/browse/SPARK-5499
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Tien-Dung LE
>
> I got an error "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task serialization failed: java.lang.StackOverflowError" when 
> executing an action with 1000 transformations.
> Here is a code snippet to re-produce the error:
> {code}
>   import org.apache.spark.rdd.RDD
>   var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
> var newPair: RDD[(Long,Long)] = null
> for (i <- 1 to 1000) {
>   newPair = pair.map(_.swap)
>   pair = newPair
> }
> println("Count = " + pair.count())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5499) iterative computing with 1000 iterations causes stage failure

2015-01-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298602#comment-14298602
 ] 

Sean Owen commented on SPARK-5499:
--

I think it's expected behavior, in the sense that you have created a lineage of 
1000 RDDs. You would want to break the lineage at some point with a call to 
persist().

> iterative computing with 1000 iterations causes stage failure
> -
>
> Key: SPARK-5499
> URL: https://issues.apache.org/jira/browse/SPARK-5499
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Tien-Dung LE
>
> I got an error "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task serialization failed: java.lang.StackOverflowError" when 
> executing an action with 1000 transformations.
> Here is a code snippet to re-produce the error:
>   import org.apache.spark.rdd.RDD
>   var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
> var newPair: RDD[(Long,Long)] = null
> for (i <- 1 to 1000) {
>   newPair = pair.map(_.swap)
>   pair = newPair
> }
> println("Count = " + pair.count())



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5499) iterative computing with 1000 iterations causes stage failure

2015-01-30 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-5499:

Description: 
I got an error "org.apache.spark.SparkException: Job aborted due to stage 
failure: Task serialization failed: java.lang.StackOverflowError" when 
executing an action with 1000 transformations.

Here is a code snippet to re-produce the error:

  import org.apache.spark.rdd.RDD
  var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
var newPair: RDD[(Long,Long)] = null

for (i <- 1 to 1000) {
  newPair = pair.map(_.swap)
  pair = newPair
}

println("Count = " + pair.count())


  was:
I got an error "org.apache.spark.SparkException: Job aborted due to stage 
failure: Task serialization failed: java.lang.StackOverflowError" when 
executing an action with 1000 transformations cause.

Here is a code snippet to re-produce the error:

  import org.apache.spark.rdd.RDD
  var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
var newPair: RDD[(Long,Long)] = null

for (i <- 1 to 1000) {
  newPair = pair.map(_.swap)
  pair = newPair
}

println("Count = " + pair.count())



> iterative computing with 1000 iterations causes stage failure
> -
>
> Key: SPARK-5499
> URL: https://issues.apache.org/jira/browse/SPARK-5499
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Tien-Dung LE
>
> I got an error "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task serialization failed: java.lang.StackOverflowError" when 
> executing an action with 1000 transformations.
> Here is a code snippet to re-produce the error:
>   import org.apache.spark.rdd.RDD
>   var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
> var newPair: RDD[(Long,Long)] = null
> for (i <- 1 to 1000) {
>   newPair = pair.map(_.swap)
>   pair = newPair
> }
> println("Count = " + pair.count())



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5499) iterative computing with 1000 iterations causes stage failure

2015-01-30 Thread Tien-Dung LE (JIRA)
Tien-Dung LE created SPARK-5499:
---

 Summary: iterative computing with 1000 iterations causes stage 
failure
 Key: SPARK-5499
 URL: https://issues.apache.org/jira/browse/SPARK-5499
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Tien-Dung LE


I got an error "org.apache.spark.SparkException: Job aborted due to stage 
failure: Task serialization failed: java.lang.StackOverflowError" when 
executing an action with 1000 transformations cause.

Here is a code snippet to re-produce the error:

  import org.apache.spark.rdd.RDD
  var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
var newPair: RDD[(Long,Long)] = null

for (i <- 1 to 1000) {
  newPair = pair.map(_.swap)
  pair = newPair
}

println("Count = " + pair.count())




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5489) KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create (I)Lscala/runtime/IntRef;

2015-01-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5489.
--
Resolution: Duplicate

Although it seems to be clearly a problem with mixing artifacts for different 
versions of Scala, this is at least the same problem in SPARK-5483

> KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create  
> (I)Lscala/runtime/IntRef;
> -
>
> Key: SPARK-5489
> URL: https://issues.apache.org/jira/browse/SPARK-5489
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Spark 1.2 
> Maven
>Reporter: DeepakVohra
>
> The KMeans clustering generates following error, which also seems to be due 
> version mismatch between Scala used for compiling Spark and Scala in Spark 
> 1.2 Maven dependency. 
> Exception in thread "main" java.lang.NoSuchMethodError: 
> scala.runtime.IntRef.create
> (I)Lscala/runtime/IntRef;
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:282)
>   at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:155)
>   at 
> org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:132)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:352)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:362)
>   at 
> org.apache.spark.mllib.clustering.KMeans.train(KMeans.scala)
>   at 
> clusterer.kmeans.KMeansClusterer.main(KMeansClusterer.java:35)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5483) java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;

2015-01-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298600#comment-14298600
 ] 

Sean Owen commented on SPARK-5483:
--

The examples don't set a master on purpose, as I understand. Like other Spark 
apps, they're supposed to be run with spark-submit, which sets the master. Your 
declaration should set the Spark dependencies as "provided". However, more 
importantly, you're mixing MLlib for Scala 2.11 with Scala 2.10 and Core for 
2.10. This has to be the problem, right?

> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
> ---
>
> Key: SPARK-5483
> URL: https://issues.apache.org/jira/browse/SPARK-5483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Maven
> Spark 1.2
>Reporter: DeepakVohra
>
> Naive Bayes classifier generates following error.
> ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   at breeze.linalg.DenseVector.(DenseVector.scala:63)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55)
>   at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:112)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:110)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:200)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:199)
>   at 
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:142)
>   at 
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:205)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/01/28 21:50:06 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-0,5,main]
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   at breeze.linalg.DenseVector.(DenseVector.scala:63)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55)
>   at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:112)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:110)
>   at 
> org.apache.s

[jira] [Commented] (SPARK-5185) pyspark --jars does not add classes to driver class path

2015-01-30 Thread Cristian Opris (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298576#comment-14298576
 ] 

Cristian Opris commented on SPARK-5185:
---

I have a similar possible issue. I need to modify the pyspark/iphyton notebook 
*driver* classpath at runtime.

While it's possible to modify the application classpath with addJars() it's not 
possible to modify the driver's classpath from within pyspark/notebook.

The main use case for this is to allow users to share an ipython server process 
and set the classpath dynamically from within running notebooks.

A possible solution is to load the py4j.GatewayServer into a dynamic 
classloader who's classpath can be modified at runtime.

Clojure shell uses a solution like this, please see 
https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/DynamicClassLoader.java


> pyspark --jars does not add classes to driver class path
> 
>
> Key: SPARK-5185
> URL: https://issues.apache.org/jira/browse/SPARK-5185
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Uri Laserson
>
> I have some random class I want access to from an Spark shell, say 
> {{com.cloudera.science.throwaway.ThrowAway}}.  You can find the specific 
> example I used here:
> https://gist.github.com/laserson/e9e3bd265e1c7a896652
> I packaged it as {{throwaway.jar}}.
> If I then run {{bin/spark-shell}} like so:
> {code}
> bin/spark-shell --master local[1] --jars throwaway.jar
> {code}
> I can execute
> {code}
> val a = new com.cloudera.science.throwaway.ThrowAway()
> {code}
> Successfully.
> I now run PySpark like so:
> {code}
> PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars 
> throwaway.jar
> {code}
> which gives me an error when I try to instantiate the class through Py4J:
> {code}
> In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway()
> ---
> Py4JError Traceback (most recent call last)
>  in ()
> > 1 sc._jvm.com.cloudera.science.throwaway.ThrowAway()
> /Users/laserson/repos/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __getattr__(self, name)
> 724 def __getattr__(self, name):
> 725 if name == '__call__':
> --> 726 raise Py4JError('Trying to call a package.')
> 727 new_fqn = self._fqn + '.' + name
> 728 command = REFLECTION_COMMAND_NAME +\
> Py4JError: Trying to call a package.
> {code}
> However, if I explicitly add the {{--driver-class-path}} to add the same jar
> {code}
> PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars 
> throwaway.jar --driver-class-path throwaway.jar
> {code}
> it works
> {code}
> In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway()
> Out[1]: JavaObject id=o18
> {code}
> However, the docs state that {{--jars}} should also set the driver class path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5498) [SPARK-SQL]when the partition schema does not match table schema,it throws java.lang.ClassCastException and so on

2015-01-30 Thread jeanlyn (JIRA)
jeanlyn created SPARK-5498:
--

 Summary: [SPARK-SQL]when the partition schema does not match table 
schema,it throws java.lang.ClassCastException and so on
 Key: SPARK-5498
 URL: https://issues.apache.org/jira/browse/SPARK-5498
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: jeanlyn


when the partition schema does not match table schema,it will thows exception 
when the task is running.For example,we modify the type of column from int to 
bigint by the sql *ALTER TABLE table_with_partition CHANGE COLUMN key key 
BIGINT* ,then we query the patition data which was stored before the 
changing,we would get the exception:
{noformat}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 27.0 failed 4 times, most recent failure: Lost task 0.3 in stage 27.0 
(TID 30, BJHC-HADOOP-HERA-16950.jeanlyn.local): java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableInt
at 
org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:241)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$13$$anonfun$apply$4.apply(TableReader.scala:286)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$13$$anonfun$apply$4.apply(TableReader.scala:286)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:322)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
at 
org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
at a

[jira] [Commented] (SPARK-5495) Offer user the ability to kill application in master web UI for standalone mode

2015-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298415#comment-14298415
 ] 

Apache Spark commented on SPARK-5495:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/4288

> Offer user the ability to kill application in master web UI for standalone 
> mode
> ---
>
> Key: SPARK-5495
> URL: https://issues.apache.org/jira/browse/SPARK-5495
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Reporter: Saisai Shao
>
> For cluster admins or users who manage the whole cluster need to have the 
> ability to kill the dangling or long-running applications through simple 
> ways. 
> For examples, if user started with a spark-shell for a long time but actually 
> is pending without any job running. In this scenario, it is better for the 
> admins to kill that apps to free the resources.
> Currently Spark user can kill the stage in driver UI, but not application. So 
> here I'd propose to add a function to kill the application in master web UI 
> for standalone mode.
> The snapshot of function shows as below:
> !https://dl.dropboxusercontent.com/u/19230832/master_ui.png!
> Add a kill action for each active application, kill action here is to simply 
> stop the specific application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5428) Declare the 'assembly' module at the bottom of the element in the parent POM

2015-01-30 Thread Christian Tzolov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298412#comment-14298412
 ] 

Christian Tzolov commented on SPARK-5428:
-

To verify this change i've tried to bundle also the spark-yarn-shuffle inside 
the DEB package.  It didn't work! 
Although put at the end of the modules' list the assembly module is executed 
before the Spark Yarn Shuffle project and therefore fails to bundle the 
yarn-shuffle jar. 

The only reliable and clean solution is to declare the required dependency in 
the assembly's pom. 
Having the assembly module at the end of the list will not guarantee that it is 
executed last. 

Unless there are some other suggestions i think we should close this issue 



> Declare the 'assembly' module at the bottom of the  element in the 
> parent POM
> --
>
> Key: SPARK-5428
> URL: https://issues.apache.org/jira/browse/SPARK-5428
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Deploy
>Reporter: Christian Tzolov
>Priority: Trivial
>  Labels: assembly, maven, pom
>
> For multiple-modules projects, Maven follows those execution order rules:
> http://maven.apache.org/guides/mini/guide-multiple-modules.html
> If no explicit dependencies are declared Maven will follow the order declared 
> in the  element.  
> Because the 'assembly' module is responsible to aggregate build artifacts 
> from other modules/project it make sense to be run last in the execution 
> chain. 
> At the moment the 'assembly' stays before modules like 'examples' which makes 
> it impossible to generate DEP package that contains the examples jar. 
> IMHO the 'assembly' needs to be kept at the bottom of the  list.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5457) Add missing DSL for ApproxCountDistinct.

2015-01-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5457.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Takuya Ueshin

> Add missing DSL for ApproxCountDistinct.
> 
>
> Key: SPARK-5457
> URL: https://issues.apache.org/jira/browse/SPARK-5457
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5479) PySpark on yarn mode need to support non-local python files

2015-01-30 Thread Vladimir Grigor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298400#comment-14298400
 ] 

Vladimir Grigor commented on SPARK-5479:


https://github.com/apache/spark/pull/3976 potentially closes this issue

> PySpark on yarn mode need to support non-local python files
> ---
>
> Key: SPARK-5479
> URL: https://issues.apache.org/jira/browse/SPARK-5479
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Lianhui Wang
>
>  In SPARK-5162 [~vgrigor] reports this:
> Now following code cannot work:
> aws emr add-steps --cluster-id "j-XYWIXMD234" \
> --steps 
> Name=SparkPi,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://mybucketat.amazonaws.com/tasks/main.py,main.py,param1],ActionOnFailure=CONTINUE
> so we need to support non-local python files on yarn client and cluster mode.
> before submitting application to Yarn, we need to download non-local files to 
> local or hdfs path.
> or spark.yarn.dist.files need to support other non-local files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5497) start-all script not working properly on Standalone HA cluster (with Zookeeper)

2015-01-30 Thread Roque Vassal'lo (JIRA)
Roque Vassal'lo created SPARK-5497:
--

 Summary: start-all script not working properly on Standalone HA 
cluster (with Zookeeper)
 Key: SPARK-5497
 URL: https://issues.apache.org/jira/browse/SPARK-5497
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.2.0
Reporter: Roque Vassal'lo


I have configured a Standalone HA cluster with Zookeeper with:
- 3 Zookeeper nodes
- 2 Spark master nodes (1 alive and 1 in standby mode)
- 2 Spark slave nodes

While executing start-all.sh on each master, it will start the master and start 
a worker on each configured slave.
If alive master goes down, those worker are supposed to reconfigure themselves 
to use the new active master automatically.

I have noticed that the spark-env property SPARK_MASTER_IP is used in both 
called scripts, start-master and start-slaves.

The problem is that if you configure SPARK_MASTER_IP with the active master ip, 
when it goes down, workers don't reassign themselves to the new active master.
And if you configure SPARK_MASTER_IP with the masters cluster route (well, an 
approximation, because you have to write master's port in all-but-last ips, 
that is "master1:7077,master2", in order to make it work), slaves start 
properly but master doesn't.

So, the start-master script needs SPARK_MASTER_IP property to contain its ip in 
order to start master properly; and start-slaves script needs SPARK_MASTER_IP 
property to contain the masters cluster ips (that is "master1:7077,master2")

To test that idea, I have modified start-slaves and spark-env scripts on master 
nodes.
On spark-env.sh, I have set SPARK_MASTER_IP property to master's own ip on each 
master node (that is, on master node 1, SPARK_MASTER_IP=master1; and on master 
node 2, SPARK_MASTER_IP=master2)
On spark-env.sh, I have added a new property SPARK_MASTER_CLUSTER_IP with the 
pseudo-masters-cluster-ips (SPARK_MASTER_CLUSTER_IP=master1:7077,master2) on 
both masters.
On start-slaves.sh, I have modified all references to SPARK_MASTER_IP to 
SPARK_MASTER_CLUSTER_IP.
I have tried that and it works great! When active master node goes down, all 
workers reassign themselves to the new active node.

Maybe there is a better fix for this issue.
Hope this quick-fix idea can help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5378) There is no return results with 'select' operating on View

2015-01-30 Thread Adrian Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Wang resolved SPARK-5378.

Resolution: Cannot Reproduce

As we discussed offline, this should came up with a bug elsewhere, and now it 
works fine.

> There is no return results with 'select' operating on View
> --
>
> Key: SPARK-5378
> URL: https://issues.apache.org/jira/browse/SPARK-5378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Yi Zhou
>
> There is 'q04_spark_RUN_QUERY_0_temp_cart_abandon' view with some of dataset 
> in system. There is no any return when execute below SQL in SparkSQL.
> SELECT * FROM q04_spark_RUN_QUERY_0_temp_cart_abandon;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5496) Allow both "classification" and "Classification" in Algo for trees

2015-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298380#comment-14298380
 ] 

Apache Spark commented on SPARK-5496:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4287

> Allow both "classification" and "Classification" in Algo for trees
> --
>
> Key: SPARK-5496
> URL: https://issues.apache.org/jira/browse/SPARK-5496
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We use "classification" in tree but "Classification" in boosting. We switched 
> to "classification" in both cases, but still need to accept "Classification" 
> to be backward compatible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5496) Allow both "classification" and "Classification" in Algo for trees

2015-01-30 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5496:


 Summary: Allow both "classification" and "Classification" in Algo 
for trees
 Key: SPARK-5496
 URL: https://issues.apache.org/jira/browse/SPARK-5496
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


We use "classification" in tree but "Classification" in boosting. We switched 
to "classification" in both cases, but still need to accept "Classification" to 
be backward compatible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5094) Python API for gradient-boosted trees

2015-01-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5094.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 3951
[https://github.com/apache/spark/pull/3951]

> Python API for gradient-boosted trees
> -
>
> Key: SPARK-5094
> URL: https://issues.apache.org/jira/browse/SPARK-5094
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Kazuki Taniguchi
>Priority: Critical
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5094) Python API for gradient-boosted trees

2015-01-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5094:
-
Fix Version/s: (was: 1.4.0)
   1.3.0

> Python API for gradient-boosted trees
> -
>
> Key: SPARK-5094
> URL: https://issues.apache.org/jira/browse/SPARK-5094
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Kazuki Taniguchi
>Priority: Critical
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2015-01-30 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298351#comment-14298351
 ] 

Guoqiang Li edited comment on SPARK-1405 at 1/30/15 8:34 AM:
-

Here is a sampling faster branch(work in progress): 
https://github.com/witgo/spark/tree/lda_MH
[It's|https://github.com/witgo/spark/tree/lda_MH] computational complexity is 
O(log(K))  K is the number of topic 
[#2388|https://github.com/apache/spark/pull/2388]'s computational complexity is 
 O(log(K)+ Nkd) ,  K is the number of topic  and  Ndk
 is the number of tokens in document d that are assigned to topic k


was (Author: gq):
Here is a sampling faster branch(work in progress): 
https://github.com/witgo/spark/tree/lda_MH
[It's|https://github.com/witgo/spark/tree/lda_MH] computational complexity is 
O(log(K))  K is the number of topic 
[#2388|https://github.com/apache/spark/pull/2388]'s computational complexity is 
 O(log(K)) + Nkd,  K is the number of topic  and  Ndk
 is the number of tokens in document d that are assigned to topic k

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Joseph K. Bradley
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2015-01-30 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298351#comment-14298351
 ] 

Guoqiang Li edited comment on SPARK-1405 at 1/30/15 8:33 AM:
-

Here is a sampling faster branch(work in progress): 
https://github.com/witgo/spark/tree/lda_MH
[It's|https://github.com/witgo/spark/tree/lda_MH] computational complexity is 
O(log(K))  K is the number of topic 
[#2388|https://github.com/apache/spark/pull/2388]'s computational complexity is 
 O(log(K)) + Nkd,  K is the number of topic  and  Ndk
 is the number of tokens in document d that are assigned to topic k


was (Author: gq):
Here is a sampling faster branch(work in progress): 
https://github.com/witgo/spark/tree/lda_MH
[It's|https://github.com/witgo/spark/tree/lda_MH] computational complexity is 
O(log(K))  K is the number of topic 
[#2388|https://github.com/apache/spark/pull/2388]'s computational complexity is 
 O(log(K)) + Nkd,  Nkd is the number of topic  and  Ndk
 is the number of tokens in document d that are assigned to topic k

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Joseph K. Bradley
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2015-01-30 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298351#comment-14298351
 ] 

Guoqiang Li commented on SPARK-1405:


Here is a sampling faster branch(work in progress): 
https://github.com/witgo/spark/tree/lda_MH
[It's|https://github.com/witgo/spark/tree/lda_MH] computational complexity is 
O(log(K))  K is the number of topic 
[#2388|https://github.com/apache/spark/pull/2388]'s computational complexity is 
 O(log(K)) + Nkd,  Nkd is the number of topic  and  Ndk
 is the number of tokens in document d that are assigned to topic k

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Joseph K. Bradley
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5452) We are migrating Tera Data SQL to Spark SQL. Query is taking long time. Please have a look on this issue

2015-01-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-5452.

Resolution: Not a Problem

> We are migrating Tera Data SQL to Spark SQL. Query is taking long time. 
> Please have a look on this issue
> 
>
> Key: SPARK-5452
> URL: https://issues.apache.org/jira/browse/SPARK-5452
> Project: Spark
>  Issue Type: Test
>  Components: Spark Shell
>Affects Versions: 1.2.0
>Reporter: irfan
>  Labels: SparkSql
>
> Hi Team,
> we are migrating TeraData SQL to Spark SQL because of complexity we have 
> spilted into below 4 sub-quries
> and we are running through  hive context
> 
> val HIVETMP1 = hc.sql("SELECT PARTY_ACCOUNT_ID AS 
> PARTY_ACCOUNT_ID,LMS_ACCOUNT_ID AS LMS_ACCOUNT_ID FROM VW_PARTY_ACCOUNT WHERE 
>  PARTY_ACCOUNT_TYPE_CODE IN('04') AND  LMS_ACCOUNT_ID  IS NOT NULL")
> HIVETMP1.registerTempTable("VW_HIVETMP1")
> val HIVETMP2 = hc.sql("SELECT PACCNT.LMS_ACCOUNT_ID AS  LMS_ACCOUNT_ID, 
> 'NULL' AS  RANDOM_PARTY_ACCOUNT_ID ,'NULL' AS  MOST_RECENT_SPEND_LA 
> ,STXN.PARTY_ACCOUNT_ID AS  MAX_SPEND_12WKS_LA ,STXN.MAX_SPEND_12WKS_LADATE  
> AS MAX_SPEND_12WKS_LADATE FROM VW_HIVETMP1 AS PACCNT  INNER JOIN (SELECT 
> STXTMP.PARTY_ACCOUNT_ID AS PARTY_ACCOUNT_ID, SUM(CASE WHEN 
> (CAST(STXTMP.TRANSACTION_DATE AS DATE ) > 
> DATE_SUB(CAST(CONCAT(SUBSTRING(SYSTMP.OPTION_VAL,1,4),'-',SUBSTRING(SYSTMP.OPTION_VAL,5,2),'-',SUBSTRING(SYSTMP.OPTION_VAL,7,2))
>  AS DATE),84)) THEN STXTMP.TRANSACTION_VALUE ELSE 0.00 END) AS 
> MAX_SPEND_12WKS_LADATE FROM VW_SHOPPING_TRANSACTION_TABLE AS STXTMP INNER 
> JOIN SYSTEM_OPTION_TABLE AS SYSTMP ON STXTMP.FLAG == SYSTMP.FLAG AND  
> SYSTMP.OPTION_NAME = 'RID' AND STXTMP.PARTY_ACCOUNT_TYPE_CODE IN('04') GROUP 
> BY STXTMP.PARTY_ACCOUNT_ID) AS STXN ON PACCNT.PARTY_ACCOUNT_ID = 
> STXN.PARTY_ACCOUNT_ID WHERE  STXN.MAX_SPEND_12WKS_LADATE IS NOT NULL")
> HIVETMP2.registerTempTable("VW_HIVETMP2")
> val HIVETMP3 = hc.sql("SELECT LMS_ACCOUNT_ID,MAX(MAX_SPEND_12WKS_LA) AS 
> MAX_SPEND_12WKS_LA, 1 AS RANK FROM VW_HIVETMP2 GROUP BY LMS_ACCOUNT_ID")
> HIVETMP3.registerTempTable("VW_HIVETMP3")
> val HIVETMP4 = hc.sql(" SELECT PACCNT.LMS_ACCOUNT_ID,'NULL' AS  
> RANDOM_PARTY_ACCOUNT_ID ,'NULL' AS  
> MOST_RECENT_SPEND_LA,STXN.MAX_SPEND_12WKS_LA AS MAX_SPEND_12WKS_LA,1 AS RANK1 
> FROM VW_HIVETMP2 AS PACCNT INNER JOIN VW_HIVETMP3 AS STXN ON 
> PACCNT.LMS_ACCOUNT_ID = STXN.LMS_ACCOUNT_ID AND PACCNT.MAX_SPEND_12WKS_LA = 
> STXN.MAX_SPEND_12WKS_LA")
> HIVETMP4.registerTempTable("WT03_ACCOUNT_BHVR3")
> HIVETMP4.saveAsTextFile("hdfs:/file/")
> ==
> This query has two Group By clauses which are running on huge files(19.5GB). 
> And the query took 40min to get the final result. Is there any changes 
> required in run time environment or Configuration Setting in Spark which can 
> improve the query performance.
> below are our Environment and configuration details:
> Environment  details:
> No of nodes:4
> capacity on each node:62 GB RAM on each node.
> Storage capacity :9TB on each node
> total cores  :48  
> Spark Configuration:
>  
> .set("spark.default.parallelism","64")
> .set("spark.driver.maxResultSize","2G")
> .set("spark.driver.memory","10g")
> .set("spark.rdd.compress","true")
> .set("spark.shuffle.spill.compress","true")
> .set("spark.shuffle.compress","true")
> .set("spark.shuffle.consolidateFiles","true/false")
> .set("spark.shuffle.spill","true/false") 
>  
> Data file size :
> SHOPPING_TRANSACTION 19.5GB
> PARTY_ACCOUNT1.4GB
> SYSTEM_OPTIONS   11.6K
> please help us to resolve above issue.
> Thanks,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5495) Offer user the ability to kill application in master web UI for standalone mode

2015-01-30 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-5495:
--

 Summary: Offer user the ability to kill application in master web 
UI for standalone mode
 Key: SPARK-5495
 URL: https://issues.apache.org/jira/browse/SPARK-5495
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: Saisai Shao


For cluster admins or users who manage the whole cluster need to have the 
ability to kill the dangling or long-running applications through simple ways. 

For examples, if user started with a spark-shell for a long time but actually 
is pending without any job running. In this scenario, it is better for the 
admins to kill that apps to free the resources.

Currently Spark user can kill the stage in driver UI, but not application. So 
here I'd propose to add a function to kill the application in master web UI for 
standalone mode.

The snapshot of function shows as below:

!https://dl.dropboxusercontent.com/u/19230832/master_ui.png!

Add a kill action for each active application, kill here to simply stop the 
specific application.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5495) Offer user the ability to kill application in master web UI for standalone mode

2015-01-30 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-5495:
---
Description: 
For cluster admins or users who manage the whole cluster need to have the 
ability to kill the dangling or long-running applications through simple ways. 

For examples, if user started with a spark-shell for a long time but actually 
is pending without any job running. In this scenario, it is better for the 
admins to kill that apps to free the resources.

Currently Spark user can kill the stage in driver UI, but not application. So 
here I'd propose to add a function to kill the application in master web UI for 
standalone mode.

The snapshot of function shows as below:

!https://dl.dropboxusercontent.com/u/19230832/master_ui.png!

Add a kill action for each active application, kill action here is to simply 
stop the specific application.


  was:
For cluster admins or users who manage the whole cluster need to have the 
ability to kill the dangling or long-running applications through simple ways. 

For examples, if user started with a spark-shell for a long time but actually 
is pending without any job running. In this scenario, it is better for the 
admins to kill that apps to free the resources.

Currently Spark user can kill the stage in driver UI, but not application. So 
here I'd propose to add a function to kill the application in master web UI for 
standalone mode.

The snapshot of function shows as below:

!https://dl.dropboxusercontent.com/u/19230832/master_ui.png!

Add a kill action for each active application, kill here to simply stop the 
specific application.



> Offer user the ability to kill application in master web UI for standalone 
> mode
> ---
>
> Key: SPARK-5495
> URL: https://issues.apache.org/jira/browse/SPARK-5495
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Reporter: Saisai Shao
>
> For cluster admins or users who manage the whole cluster need to have the 
> ability to kill the dangling or long-running applications through simple 
> ways. 
> For examples, if user started with a spark-shell for a long time but actually 
> is pending without any job running. In this scenario, it is better for the 
> admins to kill that apps to free the resources.
> Currently Spark user can kill the stage in driver UI, but not application. So 
> here I'd propose to add a function to kill the application in master web UI 
> for standalone mode.
> The snapshot of function shows as below:
> !https://dl.dropboxusercontent.com/u/19230832/master_ui.png!
> Add a kill action for each active application, kill action here is to simply 
> stop the specific application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org