[jira] [Resolved] (SPARK-3673) Move IndexedRDD from a pull request into a separate repository

2015-01-29 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-3673.
---
Resolution: Fixed

> Move IndexedRDD from a pull request into a separate repository
> --
>
> Key: SPARK-3673
> URL: https://issues.apache.org/jira/browse/SPARK-3673
> Project: Spark
>  Issue Type: Sub-task
>  Components: GraphX, Spark Core
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3673) Move IndexedRDD from a pull request into a separate repository

2015-01-29 Thread Alexander Bezzubov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298314#comment-14298314
 ] 

Alexander Bezzubov commented on SPARK-3673:
---

Looks like this was resolved by https://github.com/amplab/spark-indexedrdd

> Move IndexedRDD from a pull request into a separate repository
> --
>
> Key: SPARK-3673
> URL: https://issues.apache.org/jira/browse/SPARK-3673
> Project: Spark
>  Issue Type: Sub-task
>  Components: GraphX, Spark Core
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5494) SparkSqlSerializer Ignores KryoRegistrators

2015-01-29 Thread Hamel Ajay Kothari (JIRA)
Hamel Ajay Kothari created SPARK-5494:
-

 Summary: SparkSqlSerializer Ignores KryoRegistrators
 Key: SPARK-5494
 URL: https://issues.apache.org/jira/browse/SPARK-5494
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Hamel Ajay Kothari


We should make SparkSqlSerializer call {{super.newKryo}} before doing any of 
it's custom stuff in order to make sure it picks up on custom KryoRegistrators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5322) Add transpose() to BlockMatrix

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5322.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4275
[https://github.com/apache/spark/pull/4275]

> Add transpose() to BlockMatrix
> --
>
> Key: SPARK-5322
> URL: https://issues.apache.org/jira/browse/SPARK-5322
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Burak Yavuz
> Fix For: 1.3.0
>
>
> Once Local matrices have the option to transpose, transposing a BlockMatrix 
> will be trivial. Again, this will be a flag, which will in the end affect 
> every SubMatrix in the RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5322) Add transpose() to BlockMatrix

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5322:
-
Assignee: Burak Yavuz

> Add transpose() to BlockMatrix
> --
>
> Key: SPARK-5322
> URL: https://issues.apache.org/jira/browse/SPARK-5322
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 1.3.0
>
>
> Once Local matrices have the option to transpose, transposing a BlockMatrix 
> will be trivial. Again, this will be a flag, which will in the end affect 
> every SubMatrix in the RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2015-01-29 Thread Michael Hynes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298192#comment-14298192
 ] 

Michael Hynes commented on SPARK-3080:
--

What is the status of this SimpleALS.scala rewrite? Are you planning to merge 
it into the master branch to replace the current implementation? 

> ArrayIndexOutOfBoundsException in ALS for Large datasets
> 
>
> Key: SPARK-3080
> URL: https://issues.apache.org/jira/browse/SPARK-3080
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Burak Yavuz
>Assignee: Xiangrui Meng
>
> The stack trace is below:
> {quote}
> java.lang.ArrayIndexOutOfBoundsException: 2716
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> 
> org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> {quote}
> This happened after the dataset was sub-sampled. 
> Dataset properties: ~12B ratings
> Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5492) Thread statistics can break with older Hadoop versions

2015-01-29 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298190#comment-14298190
 ] 

Sandy Ryza commented on SPARK-5492:
---

Are you able to provide any more detail on the environment this occurred in?

I think all versions of Hadoop that don't expose StatisticsData are also 
missing a getThreadStatistics method, so they should run into a 
NoSuchMethodException at 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L160
 and not make it down to the ClassNotFoundException.

It's probably good to guard against the ClassNotFoundException anyway, but not 
sure how this would come up.

> Thread statistics can break with older Hadoop versions
> --
>
> Key: SPARK-5492
> URL: https://issues.apache.org/jira/browse/SPARK-5492
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Sandy Ryza
>Priority: Blocker
>
> {code}
>  java.lang.ClassNotFoundException: 
> org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:191)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatisticsMethod(SparkHadoopUtil.scala:180)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:120)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:118)
> at scala.Option.orElse(Option.scala:257)
> {code}
> I think the issue is we need to catch ClassNotFoundException here:
> https://github.com/apache/spark/blob/b1b35ca2e440df40b253bf967bb93705d355c1c0/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L144
> However, I'm really confused how this didn't fail our unit tests, since we 
> explicitly tried to test this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5493) Support proxy users under kerberos

2015-01-29 Thread Brock Noland (JIRA)
Brock Noland created SPARK-5493:
---

 Summary: Support proxy users under kerberos
 Key: SPARK-5493
 URL: https://issues.apache.org/jira/browse/SPARK-5493
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Brock Noland


When using kerberos, services may want to use spark-submit to submit jobs as a 
separate user. For example a service like oozie might want to submit jobs as a 
client user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5492) Thread statistics can break with older Hadoop versions

2015-01-29 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298138#comment-14298138
 ] 

Sandy Ryza commented on SPARK-5492:
---

Very weird.  I'll look into it.  Did that come up during a test?

> Thread statistics can break with older Hadoop versions
> --
>
> Key: SPARK-5492
> URL: https://issues.apache.org/jira/browse/SPARK-5492
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Sandy Ryza
>Priority: Blocker
>
> {code}
>  java.lang.ClassNotFoundException: 
> org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:191)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatisticsMethod(SparkHadoopUtil.scala:180)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:120)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:118)
> at scala.Option.orElse(Option.scala:257)
> {code}
> I think the issue is we need to catch ClassNotFoundException here:
> https://github.com/apache/spark/blob/b1b35ca2e440df40b253bf967bb93705d355c1c0/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L144
> However, I'm really confused how this didn't fail our unit tests, since we 
> explicitly tried to test this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5492) Thread statistics can break with older Hadoop versions

2015-01-29 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-5492:
-

Assignee: Sandy Ryza

> Thread statistics can break with older Hadoop versions
> --
>
> Key: SPARK-5492
> URL: https://issues.apache.org/jira/browse/SPARK-5492
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Sandy Ryza
>Priority: Blocker
>
> {code}
>  java.lang.ClassNotFoundException: 
> org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:191)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatisticsMethod(SparkHadoopUtil.scala:180)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:120)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:118)
> at scala.Option.orElse(Option.scala:257)
> {code}
> I think the issue is we need to catch ClassNotFoundException here:
> https://github.com/apache/spark/blob/b1b35ca2e440df40b253bf967bb93705d355c1c0/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L144
> However, I'm really confused how this didn't fail our unit tests, since we 
> explicitly tried to test this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3976) Detect block matrix partitioning schemes

2015-01-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298132#comment-14298132
 ] 

Apache Spark commented on SPARK-3976:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/4286

> Detect block matrix partitioning schemes
> 
>
> Key: SPARK-3976
> URL: https://issues.apache.org/jira/browse/SPARK-3976
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
>
> Provide repartitioning methods for block matrices to repartition matrix for 
> add/multiply of non-identically partitioned matrices



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3996) Shade Jetty in Spark deliverables

2015-01-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298128#comment-14298128
 ] 

Apache Spark commented on SPARK-3996:
-

User 'pwendell' has created a pull request for this issue:
https://github.com/apache/spark/pull/4285

> Shade Jetty in Spark deliverables
> -
>
> Key: SPARK-3996
> URL: https://issues.apache.org/jira/browse/SPARK-3996
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Mingyu Kim
>Assignee: Patrick Wendell
> Fix For: 1.3.0
>
>
> We'd like to use Spark in a Jetty 9 server, and it's causing a version 
> conflict. Given that Spark's dependency on Jetty is light, it'd be a good 
> idea to shade this dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in DataFrames returned from sqlCtx.sql()

2015-01-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5462.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Josh Rosen

> Catalyst UnresolvedException "Invalid call to qualifiers on unresolved 
> object" error when accessing fields in DataFrames returned from sqlCtx.sql()
> ---
>
> Key: SPARK-5462
> URL: https://issues.apache.org/jira/browse/SPARK-5462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
> Fix For: 1.3.0
>
>
> When trying to access fields on a Python DataFrame created via inferSchema, I 
> ran into a confusing Catalyst Py4J error.  Here's a reproduction:
> {code}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext, Row
> sc = SparkContext("local", "test")
> sqlContext = SQLContext(sc)
> # Load a text file and convert each line to a Row.
> lines = sc.textFile("examples/src/main/resources/people.txt")
> parts = lines.map(lambda l: l.split(","))
> people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
> # Infer the schema, and register the SchemaRDD as a table.
> schemaPeople = sqlContext.inferSchema(people)
> schemaPeople.registerTempTable("people")
> # SQL can be run over SchemaRDDs that have been registered as a table.
> teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age 
> <= 19")
> print teenagers.name
> {code}
> This fails with the following error:
> {code}
> Traceback (most recent call last):
>   File "/Users/joshrosen/Documents/spark/sqltest.py", line 19, in 
> print teenagers.name
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, 
> in __getattr__
> return Column(self._jdf.apply(name))
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o66.apply.
> : org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> qualifiers on unresolved object, tree: 'name
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This is distinct from the helpful error message that I get when trying to 
> access a non-existent column.  This error didn't occur when I tried the same 
> thing with a DataFrame created via jsonRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mai

[jira] [Commented] (SPARK-5492) Thread statistics can break with older Hadoop versions

2015-01-29 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298075#comment-14298075
 ] 

Patrick Wendell commented on SPARK-5492:


/cc [~sandyr]

> Thread statistics can break with older Hadoop versions
> --
>
> Key: SPARK-5492
> URL: https://issues.apache.org/jira/browse/SPARK-5492
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Priority: Blocker
>
> {code}
>  java.lang.ClassNotFoundException: 
> org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:191)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatisticsMethod(SparkHadoopUtil.scala:180)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:120)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:118)
> at scala.Option.orElse(Option.scala:257)
> {code}
> I think the issue is we need to catch ClassNotFoundException here:
> https://github.com/apache/spark/blob/b1b35ca2e440df40b253bf967bb93705d355c1c0/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L144
> However, I'm really confused how this didn't fail our unit tests, since we 
> explicitly tried to test this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5492) Thread statistics can break with older Hadoop versions

2015-01-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5492:
---
Priority: Blocker  (was: Major)

> Thread statistics can break with older Hadoop versions
> --
>
> Key: SPARK-5492
> URL: https://issues.apache.org/jira/browse/SPARK-5492
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Priority: Blocker
>
> {code}
>  java.lang.ClassNotFoundException: 
> org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:191)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatisticsMethod(SparkHadoopUtil.scala:180)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:120)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:118)
> at scala.Option.orElse(Option.scala:257)
> {code}
> I think the issue is we need to catch ClassNotFoundException here:
> https://github.com/apache/spark/blob/b1b35ca2e440df40b253bf967bb93705d355c1c0/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L144
> However, I'm really confused how this didn't fail our unit tests, since we 
> explicitly tried to test this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5492) Thread statistics can break with older Hadoop versions

2015-01-29 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-5492:
--

 Summary: Thread statistics can break with older Hadoop versions
 Key: SPARK-5492
 URL: https://issues.apache.org/jira/browse/SPARK-5492
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell


{code}
 java.lang.ClassNotFoundException: 
org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:191)
at 
org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatisticsMethod(SparkHadoopUtil.scala:180)
at 
org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:120)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:118)
at scala.Option.orElse(Option.scala:257)
{code}

I think the issue is we need to catch ClassNotFoundException here:
https://github.com/apache/spark/blob/b1b35ca2e440df40b253bf967bb93705d355c1c0/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L144

However, I'm really confused how this didn't fail our unit tests, since we 
explicitly tried to test this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5489) KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create (I)Lscala/runtime/IntRef;

2015-01-29 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298063#comment-14298063
 ] 

DeepakVohra commented on SPARK-5489:


If Scala 2.11.1 is used the scala.Cloneable is not found, which is available in 
Scala 2.10.4, but not not Scala 2.11.1. 

> KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create  
> (I)Lscala/runtime/IntRef;
> -
>
> Key: SPARK-5489
> URL: https://issues.apache.org/jira/browse/SPARK-5489
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Spark 1.2 
> Maven
>Reporter: DeepakVohra
>
> The KMeans clustering generates following error, which also seems to be due 
> version mismatch between Scala used for compiling Spark and Scala in Spark 
> 1.2 Maven dependency. 
> Exception in thread "main" java.lang.NoSuchMethodError: 
> scala.runtime.IntRef.create
> (I)Lscala/runtime/IntRef;
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:282)
>   at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:155)
>   at 
> org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:132)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:352)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:362)
>   at 
> org.apache.spark.mllib.clustering.KMeans.train(KMeans.scala)
>   at 
> clusterer.kmeans.KMeansClusterer.main(KMeansClusterer.java:35)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5454) [SQL] Self join with ArrayType columns problems

2015-01-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298048#comment-14298048
 ] 

Apache Spark commented on SPARK-5454:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/4284

> [SQL] Self join with ArrayType columns problems
> ---
>
> Key: SPARK-5454
> URL: https://issues.apache.org/jira/browse/SPARK-5454
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Pierre Borckmans
>
> Weird behaviour when performing self join on a table with some ArrayType 
> field.  (potential bug ?) 
> I have set up a minimal non working example here: 
> https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f
> In a nutshell, if the ArrayType column used for the pivot is created manually 
> in the StructType definition, everything works as expected. 
> However, if the ArrayType pivot column is obtained by a sql query (be it by 
> using a "array" wrapper, or using a collect_list operator for instance), then 
> results are completely off. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1473) Feature selection for high dimensional datasets

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1473:
-
Assignee: (was: Alexander Ulanov)

> Feature selection for high dimensional datasets
> ---
>
> Key: SPARK-1473
> URL: https://issues.apache.org/jira/browse/SPARK-1473
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Ignacio Zendejas
>Priority: Minor
>  Labels: features
>
> For classification tasks involving large feature spaces in the order of tens 
> of thousands or higher (e.g., text classification with n-grams, where n > 1), 
> it is often useful to rank and filter features that are irrelevant thereby 
> reducing the feature space by at least one or two orders of magnitude without 
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at 
> least two methods should be implemented with Information Gain being a 
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper 
> methods (see research papers below) which are more practical for lower 
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics 
> for text classification." The Journal of machine learning research 3 (2003): 
> 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1473) Feature selection for high dimensional datasets

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1473:
-
Target Version/s:   (was: 1.3.0)

> Feature selection for high dimensional datasets
> ---
>
> Key: SPARK-1473
> URL: https://issues.apache.org/jira/browse/SPARK-1473
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Ignacio Zendejas
>Assignee: Alexander Ulanov
>Priority: Minor
>  Labels: features
>
> For classification tasks involving large feature spaces in the order of tens 
> of thousands or higher (e.g., text classification with n-grams, where n > 1), 
> it is often useful to rank and filter features that are irrelevant thereby 
> reducing the feature space by at least one or two orders of magnitude without 
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at 
> least two methods should be implemented with Information Gain being a 
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper 
> methods (see research papers below) which are more practical for lower 
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics 
> for text classification." The Journal of machine learning research 3 (2003): 
> 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5491) Chi-square feature selection

2015-01-29 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5491:


 Summary: Chi-square feature selection
 Key: SPARK-5491
 URL: https://issues.apache.org/jira/browse/SPARK-5491
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Alexander Ulanov


Implement chi-square feature selection. PR: 
https://github.com/apache/spark/pull/1484



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5395) Large number of Python workers causing resource depletion

2015-01-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5395:
--
Target Version/s: 1.3.0, 1.2.2
   Fix Version/s: 1.3.0
Assignee: Davies Liu
  Labels: backport-needed  (was: )

I've committed Davies' patch (https://github.com/apache/spark/pull/4238) to 
{{master}} for inclusion in Spark 1.3.0 and tagged it for later backport to 
Spark 1.2.2. (I'll cherry-pick the commit after we close the 1.2.1 vote).

> Large number of Python workers causing resource depletion
> -
>
> Key: SPARK-5395
> URL: https://issues.apache.org/jira/browse/SPARK-5395
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0, 1.3.0
> Environment: AWS ElasticMapReduce
>Reporter: Sven Krasser
>Assignee: Davies Liu
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> During job execution a large number of Python worker accumulates eventually 
> causing YARN to kill containers for being over their memory allocation (in 
> the case below that is about 8G for executors plus 6G for overhead per 
> container). 
> In this instance, at the time of killing the container 97 pyspark.daemon 
> processes had accumulated.
> {noformat}
> 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler 
> (Logging.scala:logInfo(59)) - Container marked as failed: 
> container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: 
> Container [pid=35211,containerID=container_1421692415636_0052_01_30] is 
> running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB 
> physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing 
> container.
> Dump of the process-tree for container_1421692415636_0052_01_30 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) 
> VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m 
> pyspark.daemon
> |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m 
> pyspark.daemon
> |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m 
> pyspark.daemon
>   [...]
> {noformat}
> The configuration used uses 64 containers with 2 cores each.
> Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c
> Mailinglist discussion: 
> https://www.mail-archive.com/user@spark.apache.org/msg20102.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2199) Distributed probabilistic latent semantic analysis in MLlib

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2199:
-
Target Version/s: 1.4.0  (was: 1.3.0)

> Distributed probabilistic latent semantic analysis in MLlib
> ---
>
> Key: SPARK-2199
> URL: https://issues.apache.org/jira/browse/SPARK-2199
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Denis Turdakov
>Assignee: Valeriy Avanesov
>  Labels: features
>
> Probabilistic latent semantic analysis (PLSA) is a topic model which extracts 
> topics from text corpus. PLSA was historically a predecessor of LDA. However 
> recent research shows that modifications of PLSA sometimes performs better 
> then LDA[1]. Furthermore, the most recent paper by same authors shows that 
> there is a clear way to extend PLSA to LDA and beyond[2].
> We should implement distributed versions of PLSA. In addition it should be 
> possible  to easily add user defined regularizers or combination of them. We 
> will implement regularizers that allows
> * extract sparse topics
> * extract human interpretable topics 
> * perform semi-supervised training 
> * sort out non-topic specific terms. 
> [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In 
> Proceedings of ECIR'13.
> [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive 
> Regularization for Stochastic Matrix Factorization. 
> http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3080:
-
Target Version/s:   (was: 1.3.0)

> ArrayIndexOutOfBoundsException in ALS for Large datasets
> 
>
> Key: SPARK-3080
> URL: https://issues.apache.org/jira/browse/SPARK-3080
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Burak Yavuz
>Assignee: Xiangrui Meng
>
> The stack trace is below:
> {quote}
> java.lang.ArrayIndexOutOfBoundsException: 2716
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> 
> org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> {quote}
> This happened after the dataset was sub-sampled. 
> Dataset properties: ~12B ratings
> Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3147) Implement A/B testing

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3147:
-
Target Version/s:   (was: 1.3.0)

> Implement A/B testing
> -
>
> Key: SPARK-3147
> URL: https://issues.apache.org/jira/browse/SPARK-3147
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, Streaming
>Reporter: Xiangrui Meng
>
> A/B testing is widely used to compare online models. We can implement A/B 
> testing in MLlib and integrate it with Spark Streaming. For example, we have 
> a PairDStream[String, Double], whose keys are model ids and values are 
> observations (click or not, or revenue associated with the event). With A/B 
> testing, we can tell whether one model is significantly better than another 
> at a certain time. There are some caveats. For example, we should avoid 
> multiple testing and support A/A testing as a sanity check.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4259) Add Power Iteration Clustering Algorithm with Gaussian Similarity Function

2015-01-29 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298030#comment-14298030
 ] 

Xiangrui Meng commented on SPARK-4259:
--

[~andrew.musselman] PIC is more or less a spectral clustering algorithm. It 
should produce similar result when there is a significant gap between the 
second and the third eigenvalues. If there is not such a gap, it creates a 
weighted combination, which should work well in practice. Feel free to create a 
new JIRA for the original spectral clustering algorithm. But note that our goal 
is not to provide reference machine learning implementations. If PIC is an 
alternative to the original spectral clustering and it is more scalable, we 
don't want to maintain two implementations.

> Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
> --
>
> Key: SPARK-4259
> URL: https://issues.apache.org/jira/browse/SPARK-4259
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Fan Jiang
>Assignee: Fan Jiang
>  Labels: features
>
> In recent years, power Iteration clustering has become one of the most 
> popular modern clustering algorithms. It is simple to implement, can be 
> solved efficiently by standard linear algebra software, and very often 
> outperforms traditional clustering algorithms such as the k-means algorithm.
> Power iteration clustering is a scalable and efficient algorithm for 
> clustering points given pointwise mutual affinity values.  Internally the 
> algorithm:
> computes the Gaussian distance between all pairs of points and represents 
> these distances in an Affinity Matrix
> calculates a Normalized Affinity Matrix
> calculates the principal eigenvalue and eigenvector
> Clusters each of the input points according to their principal eigenvector 
> component value
> Details of this algorithm are found within [Power Iteration Clustering, Lin 
> and Cohen]{www.icml2010.org/papers/387.pdf}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3147) Implement A/B testing

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3147:
-
Target Version/s: 1.4.0

> Implement A/B testing
> -
>
> Key: SPARK-3147
> URL: https://issues.apache.org/jira/browse/SPARK-3147
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, Streaming
>Reporter: Xiangrui Meng
>
> A/B testing is widely used to compare online models. We can implement A/B 
> testing in MLlib and integrate it with Spark Streaming. For example, we have 
> a PairDStream[String, Double], whose keys are model ids and values are 
> observations (click or not, or revenue associated with the event). With A/B 
> testing, we can tell whether one model is significantly better than another 
> at a certain time. There are some caveats. For example, we should avoid 
> multiple testing and support A/A testing as a sanity check.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4259) Add Power Iteration Clustering Algorithm with Gaussian Similarity Function

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4259:
-
Target Version/s: 1.3.0

> Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
> --
>
> Key: SPARK-4259
> URL: https://issues.apache.org/jira/browse/SPARK-4259
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Fan Jiang
>Assignee: Fan Jiang
>  Labels: features
>
> In recent years, power Iteration clustering has become one of the most 
> popular modern clustering algorithms. It is simple to implement, can be 
> solved efficiently by standard linear algebra software, and very often 
> outperforms traditional clustering algorithms such as the k-means algorithm.
> Power iteration clustering is a scalable and efficient algorithm for 
> clustering points given pointwise mutual affinity values.  Internally the 
> algorithm:
> computes the Gaussian distance between all pairs of points and represents 
> these distances in an Affinity Matrix
> calculates a Normalized Affinity Matrix
> calculates the principal eigenvalue and eigenvector
> Clusters each of the input points according to their principal eigenvector 
> component value
> Details of this algorithm are found within [Power Iteration Clustering, Lin 
> and Cohen]{www.icml2010.org/papers/387.pdf}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1405:
-
Assignee: Joseph K. Bradley  (was: Guoqiang Li)

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Joseph K. Bradley
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-3996) Shade Jetty in Spark deliverables

2015-01-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-3996:


This was causing compiler failures in the master build, so I reverted it. I 
think it's the same issue we had with the guava patch, so I just need to go and 
add explicit dependencies.

> Shade Jetty in Spark deliverables
> -
>
> Key: SPARK-3996
> URL: https://issues.apache.org/jira/browse/SPARK-3996
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Mingyu Kim
>Assignee: Patrick Wendell
> Fix For: 1.3.0
>
>
> We'd like to use Spark in a Jetty 9 server, and it's causing a version 
> conflict. Given that Spark's dependency on Jetty is light, it'd be a good 
> idea to shade this dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4349) Spark driver hangs on sc.parallelize() if exception is thrown during serialization

2015-01-29 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298017#comment-14298017
 ] 

Matt Cheah edited comment on SPARK-4349 at 1/30/15 1:12 AM:


Whoops, this was fixed by SPARK-4737.


was (Author: mcheah):
Whoops, this was fixed by SPARK-4737. Someone want to close this?

> Spark driver hangs on sc.parallelize() if exception is thrown during 
> serialization
> --
>
> Key: SPARK-4349
> URL: https://issues.apache.org/jira/browse/SPARK-4349
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Matt Cheah
>Priority: Critical
>
> Executing the following in the Spark Shell will lead to the Spark Shell 
> hanging after a stack trace is printed. The serializer is set to the Kryo 
> serializer.
> {code}
> scala> import com.esotericsoftware.kryo.io.Input
> import com.esotericsoftware.kryo.io.Input
> scala> import com.esotericsoftware.kryo.io.Output
> import com.esotericsoftware.kryo.io.Output
> scala> class MyKryoSerializable extends 
> com.esotericsoftware.kryo.KryoSerializable { def write (kryo: 
> com.esotericsoftware.kryo.Kryo, output: Output) { throw new 
> com.esotericsoftware.kryo.KryoException; } ; def read (kryo: 
> com.esotericsoftware.kryo.Kryo, input: Input) { throw new 
> com.esotericsoftware.kryo.KryoException; } }
> defined class MyKryoSerializable
> scala> sc.parallelize(Seq(new MyKryoSerializable, new 
> MyKryoSerializable)).collect
> {code}
> A stack trace is printed during serialization as expected, but another stack 
> trace is printed afterwards, indicating that the driver can't recover:
> {code}
> 14/11/11 14:10:03 ERROR OneForOneStrategy: actor name [ExecutorActor] is not 
> unique!
> akka.actor.PostRestartException: exception post restart (class 
> java.io.IOException)
>   at 
> akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:249)
>   at 
> akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:247)
>   at 
> akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:302)
>   at 
> akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:297)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>   at 
> akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:247)
>   at 
> akka.actor.dungeon.FaultHandling$class.faultRecreate(FaultHandling.scala:76)
>   at akka.actor.ActorCell.faultRecreate(ActorCell.scala:369)
>   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:459)
>   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
>   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: akka.actor.InvalidActorNameException: actor name [ExecutorActor] 
> is not unique!
>   at 
> akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(ChildrenContainer.scala:130)
>   at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77)
>   at akka.actor.ActorCell.reserveChild(ActorCell.scala:369)
>   at akka.actor.dungeon.Children$class.makeChild(Children.scala:202)
>   at akka.actor.dungeon.Children$class.attachChild(Children.scala:42)
>   at akka.actor.ActorCell.attachChild(ActorCell.scala:369)
>   at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:552)
>   at org.apache.spark.executor.Executor.(Executor.scala:97)
>   at 
> org.apache.spark.scheduler.local.LocalActor.(LocalBackend.scala:53)
>   at 
> org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96)
>   at 
> org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96)
>   at akka.actor.TypedCreatorFunctionConsumer.produce(Props.scala:343)
>   at akka.actor.Props.newActor(Props.scala:252)
>   at akka.actor.ActorCell.newActor(ActorCell.scala:552)
> 

[jira] [Updated] (SPARK-5399) tree Losses strings should match loss names

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5399:
-
Assignee: Kai Sasaki

> tree Losses strings should match loss names
> ---
>
> Key: SPARK-5399
> URL: https://issues.apache.org/jira/browse/SPARK-5399
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0, 1.2.1
>Reporter: Joseph K. Bradley
>Assignee: Kai Sasaki
>Priority: Minor
>
> tree.loss.Losses.fromString expects certain String names for losses.  These 
> do not match the names of the loss classes but should.  I believe these 
> strings were the original names of the losses, and we forgot to correct the 
> strings when we renamed the losses.
> Currently:
> {code}
> case "leastSquaresError" => SquaredError
> case "leastAbsoluteError" => AbsoluteError
> case "logLoss" => LogLoss
> {code}
> Proposed:
> {code}
> case "SquaredError" => SquaredError
> case "AbsoluteError" => AbsoluteError
> case "LogLoss" => LogLoss
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4349) Spark driver hangs on sc.parallelize() if exception is thrown during serialization

2015-01-29 Thread Matt Cheah (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah closed SPARK-4349.
-
Resolution: Fixed

> Spark driver hangs on sc.parallelize() if exception is thrown during 
> serialization
> --
>
> Key: SPARK-4349
> URL: https://issues.apache.org/jira/browse/SPARK-4349
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Matt Cheah
>Priority: Critical
>
> Executing the following in the Spark Shell will lead to the Spark Shell 
> hanging after a stack trace is printed. The serializer is set to the Kryo 
> serializer.
> {code}
> scala> import com.esotericsoftware.kryo.io.Input
> import com.esotericsoftware.kryo.io.Input
> scala> import com.esotericsoftware.kryo.io.Output
> import com.esotericsoftware.kryo.io.Output
> scala> class MyKryoSerializable extends 
> com.esotericsoftware.kryo.KryoSerializable { def write (kryo: 
> com.esotericsoftware.kryo.Kryo, output: Output) { throw new 
> com.esotericsoftware.kryo.KryoException; } ; def read (kryo: 
> com.esotericsoftware.kryo.Kryo, input: Input) { throw new 
> com.esotericsoftware.kryo.KryoException; } }
> defined class MyKryoSerializable
> scala> sc.parallelize(Seq(new MyKryoSerializable, new 
> MyKryoSerializable)).collect
> {code}
> A stack trace is printed during serialization as expected, but another stack 
> trace is printed afterwards, indicating that the driver can't recover:
> {code}
> 14/11/11 14:10:03 ERROR OneForOneStrategy: actor name [ExecutorActor] is not 
> unique!
> akka.actor.PostRestartException: exception post restart (class 
> java.io.IOException)
>   at 
> akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:249)
>   at 
> akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:247)
>   at 
> akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:302)
>   at 
> akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:297)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>   at 
> akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:247)
>   at 
> akka.actor.dungeon.FaultHandling$class.faultRecreate(FaultHandling.scala:76)
>   at akka.actor.ActorCell.faultRecreate(ActorCell.scala:369)
>   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:459)
>   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
>   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: akka.actor.InvalidActorNameException: actor name [ExecutorActor] 
> is not unique!
>   at 
> akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(ChildrenContainer.scala:130)
>   at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77)
>   at akka.actor.ActorCell.reserveChild(ActorCell.scala:369)
>   at akka.actor.dungeon.Children$class.makeChild(Children.scala:202)
>   at akka.actor.dungeon.Children$class.attachChild(Children.scala:42)
>   at akka.actor.ActorCell.attachChild(ActorCell.scala:369)
>   at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:552)
>   at org.apache.spark.executor.Executor.(Executor.scala:97)
>   at 
> org.apache.spark.scheduler.local.LocalActor.(LocalBackend.scala:53)
>   at 
> org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96)
>   at 
> org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96)
>   at akka.actor.TypedCreatorFunctionConsumer.produce(Props.scala:343)
>   at akka.actor.Props.newActor(Props.scala:252)
>   at akka.actor.ActorCell.newActor(ActorCell.scala:552)
>   at 
> akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:234)
>   ... 11 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

--

[jira] [Commented] (SPARK-4349) Spark driver hangs on sc.parallelize() if exception is thrown during serialization

2015-01-29 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298017#comment-14298017
 ] 

Matt Cheah commented on SPARK-4349:
---

Whoops, this was fixed by SPARK-4737. Someone want to close this?

> Spark driver hangs on sc.parallelize() if exception is thrown during 
> serialization
> --
>
> Key: SPARK-4349
> URL: https://issues.apache.org/jira/browse/SPARK-4349
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Matt Cheah
>Priority: Critical
>
> Executing the following in the Spark Shell will lead to the Spark Shell 
> hanging after a stack trace is printed. The serializer is set to the Kryo 
> serializer.
> {code}
> scala> import com.esotericsoftware.kryo.io.Input
> import com.esotericsoftware.kryo.io.Input
> scala> import com.esotericsoftware.kryo.io.Output
> import com.esotericsoftware.kryo.io.Output
> scala> class MyKryoSerializable extends 
> com.esotericsoftware.kryo.KryoSerializable { def write (kryo: 
> com.esotericsoftware.kryo.Kryo, output: Output) { throw new 
> com.esotericsoftware.kryo.KryoException; } ; def read (kryo: 
> com.esotericsoftware.kryo.Kryo, input: Input) { throw new 
> com.esotericsoftware.kryo.KryoException; } }
> defined class MyKryoSerializable
> scala> sc.parallelize(Seq(new MyKryoSerializable, new 
> MyKryoSerializable)).collect
> {code}
> A stack trace is printed during serialization as expected, but another stack 
> trace is printed afterwards, indicating that the driver can't recover:
> {code}
> 14/11/11 14:10:03 ERROR OneForOneStrategy: actor name [ExecutorActor] is not 
> unique!
> akka.actor.PostRestartException: exception post restart (class 
> java.io.IOException)
>   at 
> akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:249)
>   at 
> akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:247)
>   at 
> akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:302)
>   at 
> akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:297)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>   at 
> akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:247)
>   at 
> akka.actor.dungeon.FaultHandling$class.faultRecreate(FaultHandling.scala:76)
>   at akka.actor.ActorCell.faultRecreate(ActorCell.scala:369)
>   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:459)
>   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
>   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: akka.actor.InvalidActorNameException: actor name [ExecutorActor] 
> is not unique!
>   at 
> akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(ChildrenContainer.scala:130)
>   at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77)
>   at akka.actor.ActorCell.reserveChild(ActorCell.scala:369)
>   at akka.actor.dungeon.Children$class.makeChild(Children.scala:202)
>   at akka.actor.dungeon.Children$class.attachChild(Children.scala:42)
>   at akka.actor.ActorCell.attachChild(ActorCell.scala:369)
>   at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:552)
>   at org.apache.spark.executor.Executor.(Executor.scala:97)
>   at 
> org.apache.spark.scheduler.local.LocalActor.(LocalBackend.scala:53)
>   at 
> org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96)
>   at 
> org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96)
>   at akka.actor.TypedCreatorFunctionConsumer.produce(Props.scala:343)
>   at akka.actor.Props.newActor(Props.scala:252)
>   at akka.actor.ActorCell.newActor(ActorCell.scala:552)
>   at 
> akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:234)
>   ... 11 more

[jira] [Updated] (SPARK-4118) Create python bindings for Streaming KMeans

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4118:
-
Target Version/s:   (was: 1.3.0)

> Create python bindings for Streaming KMeans
> ---
>
> Key: SPARK-4118
> URL: https://issues.apache.org/jira/browse/SPARK-4118
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark, Streaming
>Reporter: Anant Daksh Asthana
>Priority: Minor
>
> Create Python bindings for Streaming K-means
> This is in reference to https://issues.apache.org/jira/browse/SPARK-3254
> which adds Streaming K-means functionality to MLLib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5101) Add common ML math functions

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5101:
-
Target Version/s:   (was: 1.3.0)

> Add common ML math functions
> 
>
> Key: SPARK-5101
> URL: https://issues.apache.org/jira/browse/SPARK-5101
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>Priority: Minor
>
> We can add common ML math functions to MLlib. It may be a little tricky to 
> implement those functions in a numerically stable way. For example,
> {code}
> math.log(1 + math.exp(x))
> {code}
> should be implemented as
> {code}
> if (x > 0) {
>   x + math.log1p(math.exp(-x))
> } else {
>   math.log1p(math.exp(x))
> }
> {code}
> It becomes hard to maintain if we have multiple copies of the correct 
> implementation in the codebase. A good place for those functions could be 
> `mllib.util.MathFunctions`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3188) Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates)

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3188:
-
Target Version/s: 1.4.0  (was: 1.3.0)

> Add Robust Regression Algorithm with Tukey bisquare weight  function 
> (Biweight Estimates) 
> --
>
> Key: SPARK-3188
> URL: https://issues.apache.org/jira/browse/SPARK-3188
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Fan Jiang
>Assignee: Fan Jiang
>Priority: Minor
>  Labels: features
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Linear least square estimates assume the error has normal distribution and 
> can behave badly when the errors are heavy-tailed. In practical we get 
> various types of data. We need to include Robust Regression to employ a 
> fitting criterion that is not as vulnerable as least square.
> The Tukey bisquare weight function, also referred to as the biweight 
> function, produces an M-estimator that is more resistant to regression 
> outliers than the Huber M-estimator (Andersen 2008: 19).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5012) Python API for Gaussian Mixture Model

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5012:
-
Priority: Critical  (was: Major)

> Python API for Gaussian Mixture Model
> -
>
> Key: SPARK-5012
> URL: https://issues.apache.org/jira/browse/SPARK-5012
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Meethu Mathew
>Priority: Critical
>
> Add Python API for the Scala implementation of GMM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5094) Python API for gradient-boosted trees

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5094:
-
Priority: Critical  (was: Major)

> Python API for gradient-boosted trees
> -
>
> Key: SPARK-5094
> URL: https://issues.apache.org/jira/browse/SPARK-5094
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Kazuki Taniguchi
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4240:
-
Target Version/s:   (was: 1.3.0)

> Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
> 
>
> Key: SPARK-4240
> URL: https://issues.apache.org/jira/browse/SPARK-4240
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sung Chung
>
> The gradient boosting as currently implemented estimates the loss-gradient in 
> each iteration using regression trees. At every iteration, the regression 
> trees are trained/split to minimize predicted gradient variance. 
> Additionally, the terminal node predictions are computed to minimize the 
> prediction variance.
> However, such predictions won't be optimal for loss functions other than the 
> mean-squared error. The TreeBoosting refinement can help mitigate this issue 
> by modifying terminal node prediction values so that those predictions would 
> directly minimize the actual loss function. Although this still doesn't 
> change the fact that the tree splits were done through variance reduction, it 
> should still lead to improvement in gradient estimations, and thus better 
> performance.
> The details of this can be found in the R vignette. This paper also shows how 
> to refine the terminal node predictions.
> http://www.saedsayad.com/docs/gbm2.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4036:
-
Assignee: Kai Sasaki

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-01-29 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298012#comment-14298012
 ] 

Xiangrui Meng commented on SPARK-4036:
--

[~lewuathe] I've assigned this ticket to you. Before sending any PR, could you 
share some design doc first? This is a broad topic, we should discuss algorithm 
choices, complexity and scalability, and public APIs before digging into 
implementation.

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4036:
-
Target Version/s:   (was: 1.3.0)

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3181:
-
Target Version/s: 1.4.0  (was: 1.3.0)

> Add Robust Regression Algorithm with Huber Estimator
> 
>
> Key: SPARK-3181
> URL: https://issues.apache.org/jira/browse/SPARK-3181
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Fan Jiang
>Assignee: Fan Jiang
>  Labels: features
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Linear least square estimates assume the error has normal distribution and 
> can behave badly when the errors are heavy-tailed. In practical we get 
> various types of data. We need to include Robust Regression  to employ a 
> fitting criterion that is not as vulnerable as least square.
> In 1973, Huber introduced M-estimation for regression which stands for 
> "maximum likelihood type". The method is resistant to outliers in the 
> response variable and has been widely used.
> The new feature for MLlib will contain 3 new files
> /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
> /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
> /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala
> and one new class HuberRobustGradient in 
> /main/scala/org/apache/spark/mllib/optimization/Gradient.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5486) Add validate function for BlockMatrix

2015-01-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5486:
-
Priority: Major  (was: Critical)

> Add validate function for BlockMatrix
> -
>
> Key: SPARK-5486
> URL: https://issues.apache.org/jira/browse/SPARK-5486
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Burak Yavuz
>
> BlockMatrix needs a validate method to make debugging easy for users. 
> It will be an expensive method to perform, but it would be useful for users 
> to know why `multiply` or `add` didn't work properly.
> Things to validate:
> - MatrixBlocks that are not on the edges should have the dimensions 
> `rowsPerBlock` and `colsPerBlock`.
> - There should be at most one block for each index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5420) Cross-langauge load/store functions for creating and saving DataFrames

2015-01-29 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297987#comment-14297987
 ] 

Michael Armbrust commented on SPARK-5420:
-

Here are the dimensions that I think we need to consider:

ErrorIfExisting, Overwrite, Append.
Create a temp table, metastore table, or no table.
Specify a data source name, or use a default (from a config option that default 
to parquet).
Reynold also suggest that there is short hand for working with file based 
datasources that obviates the need to do "path" -> path.

> Cross-langauge load/store functions for creating and saving DataFrames
> --
>
> Key: SPARK-5420
> URL: https://issues.apache.org/jira/browse/SPARK-5420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Patrick Wendell
>Assignee: Yin Huai
>Priority: Blocker
>
> We should have standard API's for loading or saving a table from a data 
> store. Per comment discussion:
> {code}
> def loadData(datasource: String, parameters: Map[String, String]): DataFrame
> def loadData(datasource: String, parameters: java.util.Map[String, String]): 
> DataFrame
> def storeData(datasource: String, parameters: Map[String, String]): DataFrame
> def storeData(datasource: String, parameters: java.util.Map[String, String]): 
> DataFrame
> {code}
> Python should have this too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-01-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5472:

Priority: Blocker  (was: Minor)

> Add support for reading from and writing to a JDBC database
> ---
>
> Key: SPARK-5472
> URL: https://issues.apache.org/jira/browse/SPARK-5472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tor Myklebust
>Priority: Blocker
>
> It would be nice to be able to make a table in a JDBC database appear as a 
> table in Spark SQL.  This would let users, for instance, perform a JOIN 
> between a DataFrame in Spark SQL with a table in a Postgres database.
> It might also be nice to be able to go the other direction -- save a 
> DataFrame to a database -- for instance in an ETL job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-01-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5472:

Assignee: Tor Myklebust

> Add support for reading from and writing to a JDBC database
> ---
>
> Key: SPARK-5472
> URL: https://issues.apache.org/jira/browse/SPARK-5472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tor Myklebust
>Assignee: Tor Myklebust
>Priority: Blocker
>
> It would be nice to be able to make a table in a JDBC database appear as a 
> table in Spark SQL.  This would let users, for instance, perform a JOIN 
> between a DataFrame in Spark SQL with a table in a Postgres database.
> It might also be nice to be able to go the other direction -- save a 
> DataFrame to a database -- for instance in an ETL job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-01-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5472:

Target Version/s: 1.3.0

> Add support for reading from and writing to a JDBC database
> ---
>
> Key: SPARK-5472
> URL: https://issues.apache.org/jira/browse/SPARK-5472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tor Myklebust
>Priority: Blocker
>
> It would be nice to be able to make a table in a JDBC database appear as a 
> table in Spark SQL.  This would let users, for instance, perform a JOIN 
> between a DataFrame in Spark SQL with a table in a Postgres database.
> It might also be nice to be able to go the other direction -- save a 
> DataFrame to a database -- for instance in an ETL job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4959) Attributes are case sensitive when using a select query from a projection

2015-01-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4959.
-
Resolution: Fixed

> Attributes are case sensitive when using a select query from a projection
> -
>
> Key: SPARK-4959
> URL: https://issues.apache.org/jira/browse/SPARK-4959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andy Konwinski
>Assignee: Cheng Hao
>Priority: Blocker
>  Labels: backport-needed
> Fix For: 1.3.0, 1.2.1
>
>
> Per [~marmbrus], see this line of code, where we should be using an attribute 
> map
>  
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147
> To reproduce, i ran the following in the Spark shell:
> {code}
> import sqlContext._
> sql("drop table if exists test")
> sql("create table test (col1 string)")
> sql("""insert into table test select "hi" from prejoined limit 1""")
> val projection = "col1".attr.as(Symbol("CaseSensitiveColName")) :: 
> "col1".attr.as(Symbol("CaseSensitiveColName2")) :: Nil
> sqlContext.table("test").select(projection:_*).registerTempTable("test2")
> # This succeeds.
> sql("select CaseSensitiveColName from test2").first()
> # This fails with java.util.NoSuchElementException: key not found: 
> casesensitivecolname#23046
> sql("select casesensitivecolname from test2").first()
> {code}
> The full stack trace printed for the final command that is failing: 
> {code}
> java.util.NoSuchElementException: key not found: casesensitivecolname#23046
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:57)
>   at 
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
>   at 
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
>   at 
> org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378)
>   at 
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
>   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
>   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446)
>   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108)
>   at org.apache.spark.rdd.RDD.first(RDD.scala:1093)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscri

[jira] [Updated] (SPARK-3778) newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn

2015-01-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3778:
---
Priority: Critical  (was: Major)

> newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn
> -
>
> Key: SPARK-3778
> URL: https://issues.apache.org/jira/browse/SPARK-3778
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
>
> The newAPIHadoopRDD routine doesn't properly add the credentials to the conf 
> to be able to access secure hdfs.
> Note that newAPIHadoopFile does handle these because the 
> org.apache.hadoop.mapreduce.Job automatically adds it for you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3778) newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn

2015-01-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3778:
---
Target Version/s: 1.3.0  (was: 1.1.1, 1.2.0)

> newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn
> -
>
> Key: SPARK-3778
> URL: https://issues.apache.org/jira/browse/SPARK-3778
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> The newAPIHadoopRDD routine doesn't properly add the credentials to the conf 
> to be able to access secure hdfs.
> Note that newAPIHadoopFile does handle these because the 
> org.apache.hadoop.mapreduce.Job automatically adds it for you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3996) Shade Jetty in Spark deliverables

2015-01-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3996.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Patrick Wendell  (was: Matthew Cheah)

Okay we merged this into master, let's see how it goes.

> Shade Jetty in Spark deliverables
> -
>
> Key: SPARK-3996
> URL: https://issues.apache.org/jira/browse/SPARK-3996
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Mingyu Kim
>Assignee: Patrick Wendell
> Fix For: 1.3.0
>
>
> We'd like to use Spark in a Jetty 9 server, and it's causing a version 
> conflict. Given that Spark's dependency on Jetty is light, it'd be a good 
> idea to shade this dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in DataFrames returned from sqlCtx.sql()

2015-01-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297956#comment-14297956
 ] 

Apache Spark commented on SPARK-5462:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4282

> Catalyst UnresolvedException "Invalid call to qualifiers on unresolved 
> object" error when accessing fields in DataFrames returned from sqlCtx.sql()
> ---
>
> Key: SPARK-5462
> URL: https://issues.apache.org/jira/browse/SPARK-5462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> When trying to access fields on a Python DataFrame created via inferSchema, I 
> ran into a confusing Catalyst Py4J error.  Here's a reproduction:
> {code}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext, Row
> sc = SparkContext("local", "test")
> sqlContext = SQLContext(sc)
> # Load a text file and convert each line to a Row.
> lines = sc.textFile("examples/src/main/resources/people.txt")
> parts = lines.map(lambda l: l.split(","))
> people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
> # Infer the schema, and register the SchemaRDD as a table.
> schemaPeople = sqlContext.inferSchema(people)
> schemaPeople.registerTempTable("people")
> # SQL can be run over SchemaRDDs that have been registered as a table.
> teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age 
> <= 19")
> print teenagers.name
> {code}
> This fails with the following error:
> {code}
> Traceback (most recent call last):
>   File "/Users/joshrosen/Documents/spark/sqltest.py", line 19, in 
> print teenagers.name
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, 
> in __getattr__
> return Column(self._jdf.apply(name))
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o66.apply.
> : org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> qualifiers on unresolved object, tree: 'name
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This is distinct from the helpful error message that I get when trying to 
> access a non-existent column.  This error didn't occur when I tried the same 
> thing with a DataFrame created via jsonRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Commented] (SPARK-5489) KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create (I)Lscala/runtime/IntRef;

2015-01-29 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297951#comment-14297951
 ] 

DeepakVohra commented on SPARK-5489:


Sean,

Some dependency is making use of scala.runtime.IntRef.create, which was 
introduced in Scala 2.11.  
https://github.com/scala/scala/blob/v2.11.0/src/library/scala/runtime/IntRef.java

Scala 2.10.4, which is included with Spark 1.2, does not include the 
scala.runtime.IntRef.create method.
https://github.com/scala/scala/blob/v2.10.4/src/library/scala/runtime/IntRef.java

thanks,
Deepak

> KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create  
> (I)Lscala/runtime/IntRef;
> -
>
> Key: SPARK-5489
> URL: https://issues.apache.org/jira/browse/SPARK-5489
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Spark 1.2 
> Maven
>Reporter: DeepakVohra
>
> The KMeans clustering generates following error, which also seems to be due 
> version mismatch between Scala used for compiling Spark and Scala in Spark 
> 1.2 Maven dependency. 
> Exception in thread "main" java.lang.NoSuchMethodError: 
> scala.runtime.IntRef.create
> (I)Lscala/runtime/IntRef;
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:282)
>   at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:155)
>   at 
> org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:132)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:352)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:362)
>   at 
> org.apache.spark.mllib.clustering.KMeans.train(KMeans.scala)
>   at 
> clusterer.kmeans.KMeansClusterer.main(KMeansClusterer.java:35)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5424) Make the new ALS implementation take generic ID types

2015-01-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297949#comment-14297949
 ] 

Apache Spark commented on SPARK-5424:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4281

> Make the new ALS implementation take generic ID types
> -
>
> Key: SPARK-5424
> URL: https://issues.apache.org/jira/browse/SPARK-5424
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> The new implementation uses local indices of users and items. So the input 
> user/item type could be generic, at least specialized for Int and Long. We 
> can expose the generic interface as a developer API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5464) Calling help() on a Python DataFrame fails with "cannot resolve column name __name__" error

2015-01-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5464.

   Resolution: Fixed
Fix Version/s: 1.3.0

> Calling help() on a Python DataFrame fails with "cannot resolve column name 
> __name__" error
> ---
>
> Key: SPARK-5464
> URL: https://issues.apache.org/jira/browse/SPARK-5464
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
> Fix For: 1.3.0
>
>
> Trying to call {{help()}} on a Python DataFrame fails with an exception:
> {code}
> >>> help(df)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/joshrosen/anaconda/lib/python2.7/site.py", line 464, in 
> __call__
> return pydoc.help(*args, **kwds)
>   File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1787, in 
> __call__
> self.help(request)
>   File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1834, in help
> else: doc(request, 'Help on %s:')
>   File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1571, in doc
> pager(render_doc(thing, title, forceload))
>   File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1545, in 
> render_doc
> object, name = resolve(thing, forceload)
>   File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1540, in 
> resolve
> name = getattr(thing, '__name__', None)
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, 
> in __getattr__
> return Column(self._jdf.apply(name))
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o31.apply.
> : java.lang.RuntimeException: Cannot resolve column name "__name__"
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:123)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:123)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Here's a reproduction:
> {code}
> >>> from pyspark.sql import SQLContext, Row
> >>> sqlContext = SQLContext(sc)
> >>> rdd = sc.parallelize(['{"foo":"bar"}', '{"foo":"baz"}'])
> >>> df = sqlContext.jsonRDD(rdd)
> >>> help(df)
> {code}
> I think the problem here is that we don't throw the expected exception from 
> our overloaded {{getattr}} if a column can't be found.
> We should be able to fix this by only attempting to call {{apply}} after 
> checking that the column name is valid (e.g. check against {{columns}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5489) KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create (I)Lscala/runtime/IntRef;

2015-01-29 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297937#comment-14297937
 ] 

DeepakVohra commented on SPARK-5489:


Sean,

Made the Scala version the same, but still getting the error.
"For the Scala API, Spark 1.2.0 uses Scala 2.10. "
http://spark.apache.org/docs/1.2.0/

Made Maven dependencies Scala version also 2.10.


org.apache.spark
spark-core_2.10
1.2.0


org.scala-lang
scala-library


org.scala-lang
scala-compiler





org.apache.spark
spark-mllib_2.11
1.2.0


org.scala-lang
scala-library


org.scala-lang
scala-compiler




org.scala-lang
scala-library
2.10.0



org.scala-lang
scala-compiler
2.10.0




thanks,
Deepak

> KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create  
> (I)Lscala/runtime/IntRef;
> -
>
> Key: SPARK-5489
> URL: https://issues.apache.org/jira/browse/SPARK-5489
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Spark 1.2 
> Maven
>Reporter: DeepakVohra
>
> The KMeans clustering generates following error, which also seems to be due 
> version mismatch between Scala used for compiling Spark and Scala in Spark 
> 1.2 Maven dependency. 
> Exception in thread "main" java.lang.NoSuchMethodError: 
> scala.runtime.IntRef.create
> (I)Lscala/runtime/IntRef;
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:282)
>   at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:155)
>   at 
> org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:132)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:352)
>   at 
> org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:362)
>   at 
> org.apache.spark.mllib.clustering.KMeans.train(KMeans.scala)
>   at 
> clusterer.kmeans.KMeansClusterer.main(KMeansClusterer.java:35)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5483) java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;

2015-01-29 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297936#comment-14297936
 ] 

DeepakVohra commented on SPARK-5483:


Sean,

Made the Scala version the same, but still getting the error.

"For the Scala API, Spark 1.2.0 uses Scala 2.10. "
http://spark.apache.org/docs/1.2.0/

Made Maven dependencies Scala version also 2.10.


org.apache.spark
spark-core_2.10
1.2.0


org.scala-lang
scala-library


org.scala-lang
scala-compiler





org.apache.spark
spark-mllib_2.11
1.2.0


org.scala-lang
scala-library


org.scala-lang
scala-compiler




org.scala-lang
scala-library
2.10.0



org.scala-lang
scala-compiler
2.10.0




thanks,
Deepak

> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
> ---
>
> Key: SPARK-5483
> URL: https://issues.apache.org/jira/browse/SPARK-5483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Maven
> Spark 1.2
>Reporter: DeepakVohra
>
> Naive Bayes classifier generates following error.
> ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   at breeze.linalg.DenseVector.(DenseVector.scala:63)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55)
>   at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:112)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:110)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:200)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:199)
>   at 
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:142)
>   at 
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:205)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/01/28 21:50:06 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Th

[jira] [Commented] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in DataFrames returned from sqlCtx.sql()

2015-01-29 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297935#comment-14297935
 ] 

Josh Rosen commented on SPARK-5462:
---

[~liancheng] [~marmbrus] Is this possibly related to SPARK-2063?

> Catalyst UnresolvedException "Invalid call to qualifiers on unresolved 
> object" error when accessing fields in DataFrames returned from sqlCtx.sql()
> ---
>
> Key: SPARK-5462
> URL: https://issues.apache.org/jira/browse/SPARK-5462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> When trying to access fields on a Python DataFrame created via inferSchema, I 
> ran into a confusing Catalyst Py4J error.  Here's a reproduction:
> {code}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext, Row
> sc = SparkContext("local", "test")
> sqlContext = SQLContext(sc)
> # Load a text file and convert each line to a Row.
> lines = sc.textFile("examples/src/main/resources/people.txt")
> parts = lines.map(lambda l: l.split(","))
> people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
> # Infer the schema, and register the SchemaRDD as a table.
> schemaPeople = sqlContext.inferSchema(people)
> schemaPeople.registerTempTable("people")
> # SQL can be run over SchemaRDDs that have been registered as a table.
> teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age 
> <= 19")
> print teenagers.name
> {code}
> This fails with the following error:
> {code}
> Traceback (most recent call last):
>   File "/Users/joshrosen/Documents/spark/sqltest.py", line 19, in 
> print teenagers.name
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, 
> in __getattr__
> return Column(self._jdf.apply(name))
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o66.apply.
> : org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> qualifiers on unresolved object, tree: 'name
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This is distinct from the helpful error message that I get when trying to 
> access a non-existent column.  This error didn't occur when I tried the same 
> thing with a DataFrame created via jsonRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional command

[jira] [Updated] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in DataFrames returned from sqlCtx.sql()

2015-01-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5462:
--
Component/s: (was: PySpark)

> Catalyst UnresolvedException "Invalid call to qualifiers on unresolved 
> object" error when accessing fields in DataFrames returned from sqlCtx.sql()
> ---
>
> Key: SPARK-5462
> URL: https://issues.apache.org/jira/browse/SPARK-5462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> When trying to access fields on a Python DataFrame created via inferSchema, I 
> ran into a confusing Catalyst Py4J error.  Here's a reproduction:
> {code}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext, Row
> sc = SparkContext("local", "test")
> sqlContext = SQLContext(sc)
> # Load a text file and convert each line to a Row.
> lines = sc.textFile("examples/src/main/resources/people.txt")
> parts = lines.map(lambda l: l.split(","))
> people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
> # Infer the schema, and register the SchemaRDD as a table.
> schemaPeople = sqlContext.inferSchema(people)
> schemaPeople.registerTempTable("people")
> # SQL can be run over SchemaRDDs that have been registered as a table.
> teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age 
> <= 19")
> print teenagers.name
> {code}
> This fails with the following error:
> {code}
> Traceback (most recent call last):
>   File "/Users/joshrosen/Documents/spark/sqltest.py", line 19, in 
> print teenagers.name
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, 
> in __getattr__
> return Column(self._jdf.apply(name))
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o66.apply.
> : org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> qualifiers on unresolved object, tree: 'name
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This is distinct from the helpful error message that I get when trying to 
> access a non-existent column.  This error didn't occur when I tried the same 
> thing with a DataFrame created via jsonRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in DataFrames returned from sqlCtx.sql()

2015-01-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5462:
--
Assignee: (was: Josh Rosen)

> Catalyst UnresolvedException "Invalid call to qualifiers on unresolved 
> object" error when accessing fields in DataFrames returned from sqlCtx.sql()
> ---
>
> Key: SPARK-5462
> URL: https://issues.apache.org/jira/browse/SPARK-5462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> When trying to access fields on a Python DataFrame created via inferSchema, I 
> ran into a confusing Catalyst Py4J error.  Here's a reproduction:
> {code}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext, Row
> sc = SparkContext("local", "test")
> sqlContext = SQLContext(sc)
> # Load a text file and convert each line to a Row.
> lines = sc.textFile("examples/src/main/resources/people.txt")
> parts = lines.map(lambda l: l.split(","))
> people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
> # Infer the schema, and register the SchemaRDD as a table.
> schemaPeople = sqlContext.inferSchema(people)
> schemaPeople.registerTempTable("people")
> # SQL can be run over SchemaRDDs that have been registered as a table.
> teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age 
> <= 19")
> print teenagers.name
> {code}
> This fails with the following error:
> {code}
> Traceback (most recent call last):
>   File "/Users/joshrosen/Documents/spark/sqltest.py", line 19, in 
> print teenagers.name
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, 
> in __getattr__
> return Column(self._jdf.apply(name))
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o66.apply.
> : org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> qualifiers on unresolved object, tree: 'name
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This is distinct from the helpful error message that I get when trying to 
> access a non-existent column.  This error didn't occur when I tried the same 
> thing with a DataFrame created via jsonRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in DataFrames returned from sqlCtx.sql()

2015-01-29 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297934#comment-14297934
 ] 

Josh Rosen commented on SPARK-5462:
---

Actually, this issue isn't Python-specific: it also occurs when running the 
"people / teenagers" example from the SQL Programming Guide in the regular 
Spark Shell:

{code}
scala> teenagers("name")
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
qualifiers on unresolved object, tree: 'name
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50)
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:120)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:258)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:20)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:25)
at $iwC$$iwC$$iwC$$iwC.(:27)
at $iwC$$iwC$$iwC.(:29)
at $iwC$$iwC.(:31)
at $iwC.(:33)
at (:35)
at .(:39)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:854)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:899)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:811)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:654)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:662)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:667)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:994)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:942)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:942)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:942)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1039)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:366)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

> Catalyst UnresolvedException "Invalid call to qualifiers on unresolved 
> object" error when accessing fields in DataFrames returned from sqlCtx.sql()
> ---
>
> Key: SPARK-5462
> URL: 

[jira] [Updated] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in DataFrames returned from sqlCtx.sql()

2015-01-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5462:
--
Summary: Catalyst UnresolvedException "Invalid call to qualifiers on 
unresolved object" error when accessing fields in DataFrames returned from 
sqlCtx.sql()  (was: Catalyst UnresolvedException "Invalid call to qualifiers on 
unresolved object" error when accessing fields in Python DataFrame)

> Catalyst UnresolvedException "Invalid call to qualifiers on unresolved 
> object" error when accessing fields in DataFrames returned from sqlCtx.sql()
> ---
>
> Key: SPARK-5462
> URL: https://issues.apache.org/jira/browse/SPARK-5462
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> When trying to access fields on a Python DataFrame created via inferSchema, I 
> ran into a confusing Catalyst Py4J error.  Here's a reproduction:
> {code}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext, Row
> sc = SparkContext("local", "test")
> sqlContext = SQLContext(sc)
> # Load a text file and convert each line to a Row.
> lines = sc.textFile("examples/src/main/resources/people.txt")
> parts = lines.map(lambda l: l.split(","))
> people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
> # Infer the schema, and register the SchemaRDD as a table.
> schemaPeople = sqlContext.inferSchema(people)
> schemaPeople.registerTempTable("people")
> # SQL can be run over SchemaRDDs that have been registered as a table.
> teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age 
> <= 19")
> print teenagers.name
> {code}
> This fails with the following error:
> {code}
> Traceback (most recent call last):
>   File "/Users/joshrosen/Documents/spark/sqltest.py", line 19, in 
> print teenagers.name
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, 
> in __getattr__
> return Column(self._jdf.apply(name))
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o66.apply.
> : org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> qualifiers on unresolved object, tree: 'name
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This is distinct from the helpful error message that I get when trying to 
> access a non-existent column.  This error didn't occur when I tried the same 
> thing with a DataFrame created via jsonRDD.



--
This message was sent 

[jira] [Resolved] (SPARK-5373) literal in agg grouping expressioons leads to incorrect result

2015-01-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5373.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4169
[https://github.com/apache/spark/pull/4169]

>  literal in agg grouping expressioons leads to incorrect result
> ---
>
> Key: SPARK-5373
> URL: https://issues.apache.org/jira/browse/SPARK-5373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: wangfei
> Fix For: 1.3.0
>
>
> select key, count( * ) from src group by key, 1 will get the wrong answer!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4133) PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0

2015-01-29 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297919#comment-14297919
 ] 

Derrick Burns commented on SPARK-4133:
--

I worked around it, so feel free

On Thu, Jan 29, 2015 at 11:28 AM, Tathagata Das (JIRA) 



> PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0
> --
>
> Key: SPARK-4133
> URL: https://issues.apache.org/jira/browse/SPARK-4133
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Antonio Jesus Navarro
> Attachments: spark_ex.logs
>
>
> Snappy related problems found when trying to upgrade existing Spark Streaming 
> App from 1.0.2 to 1.1.0.
> We can not run an existing 1.0.2 spark app if upgraded to 1.1.0
> > IOException is thrown by snappy (parsing_error(2))
> {code}
> Executor task launch worker-0 DEBUG storage.BlockManager - Getting local 
> block broadcast_0
> Executor task launch worker-0 DEBUG storage.BlockManager - Level for block 
> broadcast_0 is StorageLevel(true, true, false, true, 1)
> Executor task launch worker-0 DEBUG storage.BlockManager - Getting block 
> broadcast_0 from memory
> Executor task launch worker-0 DEBUG storage.BlockManager - Getting local 
> block broadcast_0
> Executor task launch worker-0 DEBUG executor.Executor - Task 0's epoch is 0
> Executor task launch worker-0 DEBUG storage.BlockManager - Block broadcast_0 
> not registered locally
> Executor task launch worker-0 INFO  broadcast.TorrentBroadcast - Started 
> reading broadcast variable 0
> sparkDriver-akka.actor.default-dispatcher-4 INFO  
> receiver.ReceiverSupervisorImpl - Registered receiver 0
> Executor task launch worker-0 INFO  util.RecurringTimer - Started timer for 
> BlockGenerator at time 1414656492400
> Executor task launch worker-0 INFO  receiver.BlockGenerator - Started 
> BlockGenerator
> Thread-87 INFO  receiver.BlockGenerator - Started block pushing thread
> Executor task launch worker-0 INFO  receiver.ReceiverSupervisorImpl - 
> Starting receiver
> sparkDriver-akka.actor.default-dispatcher-5 INFO  scheduler.ReceiverTracker - 
> Registered receiver for stream 0 from akka://sparkDriver
> Executor task launch worker-0 INFO  kafka.KafkaReceiver - Starting Kafka 
> Consumer Stream with group: stratioStreaming
> Executor task launch worker-0 INFO  kafka.KafkaReceiver - Connecting to 
> Zookeeper: node.stratio.com:2181
> sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] 
> received message StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 
> cap=0]) from Actor[akka://sparkDriver/deadLetters]
> sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] 
> received message StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 
> cap=0]) from Actor[akka://sparkDriver/deadLetters]
> sparkDriver-akka.actor.default-dispatcher-6 DEBUG local.LocalActor - [actor] 
> received message StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 
> cap=0]) from Actor[akka://sparkDriver/deadLetters]
> sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] 
> handled message (8.442354 ms) 
> StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from 
> Actor[akka://sparkDriver/deadLetters]
> sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] 
> handled message (8.412421 ms) 
> StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from 
> Actor[akka://sparkDriver/deadLetters]
> sparkDriver-akka.actor.default-dispatcher-6 DEBUG local.LocalActor - [actor] 
> handled message (8.385471 ms) 
> StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from 
> Actor[akka://sparkDriver/deadLetters]
> Executor task launch worker-0 INFO  utils.VerifiableProperties - Verifying 
> properties
> Executor task launch worker-0 INFO  utils.VerifiableProperties - Property 
> group.id is overridden to stratioStreaming
> Executor task launch worker-0 INFO  utils.VerifiableProperties - Property 
> zookeeper.connect is overridden to node.stratio.com:2181
> Executor task launch worker-0 INFO  utils.VerifiableProperties - Property 
> zookeeper.connection.timeout.ms is overridden to 1
> Executor task launch worker-0 INFO  broadcast.TorrentBroadcast - Reading 
> broadcast variable 0 took 0.033998997 s
> Executor task launch worker-0 INFO  consumer.ZookeeperConsumerConnector - 
> [stratioStreaming_ajn-stratio-1414656492293-8ecb3e3a], Connecting to 
> zookeeper instance at node.stratio.com:2181
> Executor task launch worker-0 DEBUG zkclient.ZkConnection - Creating new 
> ZookKeeper instance to connect to node.stratio.com:2181.
> ZkClient-EventThread-169-node.stratio.com:2181 INFO  zkclient.ZkEventThread - 
> Starting ZkClient event thread.
> Executor task launch worker-0 

[jira] [Resolved] (SPARK-5367) support star expression in udf

2015-01-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5367.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4163
[https://github.com/apache/spark/pull/4163]

> support star expression in udf
> --
>
> Key: SPARK-5367
> URL: https://issues.apache.org/jira/browse/SPARK-5367
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: wangfei
> Fix For: 1.3.0
>
>
> now spark sql does not support star expression in udf, the following sql will 
> get error
> ```
> select concat( * ) from src
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in Python DataFrame

2015-01-29 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297913#comment-14297913
 ] 

Josh Rosen commented on SPARK-5462:
---

I'm working on a patch for this now.  It looks like the problem crops up when 
trying to select columns from DataFrames that are returned by SQL queries, as 
opposed to ones created by applying or inferring a schema.  Here's a regression 
test demonstrating this:

{code}

def test_column_selection_on_dataframes_created_by_queries(self):
# Regression test for SPARK-5462
df = self.df
df.registerTempTable("test")
df_from_query = self.sqlCtx.sql("select key, values from test")
df_from_query.key  # Throws exception
df_from_query.value
{code}

> Catalyst UnresolvedException "Invalid call to qualifiers on unresolved 
> object" error when accessing fields in Python DataFrame
> --
>
> Key: SPARK-5462
> URL: https://issues.apache.org/jira/browse/SPARK-5462
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> When trying to access fields on a Python DataFrame created via inferSchema, I 
> ran into a confusing Catalyst Py4J error.  Here's a reproduction:
> {code}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext, Row
> sc = SparkContext("local", "test")
> sqlContext = SQLContext(sc)
> # Load a text file and convert each line to a Row.
> lines = sc.textFile("examples/src/main/resources/people.txt")
> parts = lines.map(lambda l: l.split(","))
> people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
> # Infer the schema, and register the SchemaRDD as a table.
> schemaPeople = sqlContext.inferSchema(people)
> schemaPeople.registerTempTable("people")
> # SQL can be run over SchemaRDDs that have been registered as a table.
> teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age 
> <= 19")
> print teenagers.name
> {code}
> This fails with the following error:
> {code}
> Traceback (most recent call last):
>   File "/Users/joshrosen/Documents/spark/sqltest.py", line 19, in 
> print teenagers.name
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, 
> in __getattr__
> return Column(self._jdf.apply(name))
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o66.apply.
> : org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> qualifiers on unresolved object, tree: 'name
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(Gatew

[jira] [Resolved] (SPARK-4786) Parquet filter pushdown for BYTE and SHORT types

2015-01-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4786.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4156
[https://github.com/apache/spark/pull/4156]

> Parquet filter pushdown for BYTE and SHORT types
> 
>
> Key: SPARK-4786
> URL: https://issues.apache.org/jira/browse/SPARK-4786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
> Fix For: 1.3.0
>
>
> Among all integral types, currently only INT and LONG predicates can be 
> converted to Parquet filter predicate. BYTE and SHORT predicates can be 
> covered by INT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5309) Reduce Binary/String conversion overhead when reading/writing Parquet files

2015-01-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5309.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4187
[https://github.com/apache/spark/pull/4187]

> Reduce Binary/String conversion overhead when reading/writing Parquet files
> ---
>
> Key: SPARK-5309
> URL: https://issues.apache.org/jira/browse/SPARK-5309
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: MIchael Davies
>Priority: Minor
> Fix For: 1.3.0
>
>
> Converting between Parquet Binary and Java Strings can form a significant 
> proportion of query times.
> For columns which have repeated String values (which is common) the same 
> Binary will be repeatedly being converted. 
> A simple change to cache the last converted String per column was shown to 
> reduce query times by 25% when grouping on a data set of 66M rows on a column 
> with many repeated Strings.
> A possible optimisation would be to hand responsibility for Binary 
> encoding/decoding over to Parquet so that it could ensure that this was done 
> only once per Binary value. 
> Next step is to look at Parquet code and to discuss with that project, which 
> I will do.
> More details are available on this discussion:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-td10141.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in Python DataFrame

2015-01-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-5462:
-

Assignee: Josh Rosen

> Catalyst UnresolvedException "Invalid call to qualifiers on unresolved 
> object" error when accessing fields in Python DataFrame
> --
>
> Key: SPARK-5462
> URL: https://issues.apache.org/jira/browse/SPARK-5462
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> When trying to access fields on a Python DataFrame created via inferSchema, I 
> ran into a confusing Catalyst Py4J error.  Here's a reproduction:
> {code}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext, Row
> sc = SparkContext("local", "test")
> sqlContext = SQLContext(sc)
> # Load a text file and convert each line to a Row.
> lines = sc.textFile("examples/src/main/resources/people.txt")
> parts = lines.map(lambda l: l.split(","))
> people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
> # Infer the schema, and register the SchemaRDD as a table.
> schemaPeople = sqlContext.inferSchema(people)
> schemaPeople.registerTempTable("people")
> # SQL can be run over SchemaRDDs that have been registered as a table.
> teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age 
> <= 19")
> print teenagers.name
> {code}
> This fails with the following error:
> {code}
> Traceback (most recent call last):
>   File "/Users/joshrosen/Documents/spark/sqltest.py", line 19, in 
> print teenagers.name
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, 
> in __getattr__
> return Column(self._jdf.apply(name))
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o66.apply.
> : org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> qualifiers on unresolved object, tree: 'name
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This is distinct from the helpful error message that I get when trying to 
> access a non-existent column.  This error didn't occur when I tried the same 
> thing with a DataFrame created via jsonRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5429) Can't generate Hive golden answer on Hive 0.13.1

2015-01-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust closed SPARK-5429.
---
   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Liang-Chi Hsieh

> Can't generate Hive golden answer on Hive 0.13.1
> 
>
> Key: SPARK-5429
> URL: https://issues.apache.org/jira/browse/SPARK-5429
> Project: Spark
>  Issue Type: Bug
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 1.3.0
>
>
> I found that running HiveComparisonTest.createQueryTest to generate Hive 
> golden answer files on Hive 0.13.1 would throw KryoException. Since Hive 
> 0.13.0, Kryo plan serialization is introduced alongside javaXML one. This is 
> a quick fix to set hive configuration to use javaXML serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5490) KMeans costs can be incorrect if tasks need to be rerun

2015-01-29 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-5490:
-

 Summary: KMeans costs can be incorrect if tasks need to be rerun
 Key: SPARK-5490
 URL: https://issues.apache.org/jira/browse/SPARK-5490
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza


KMeans uses accumulators to compute the cost of a clustering at each iteration.

Each time a ShuffleMapTask completes, it increments the accumulators at the 
driver.  If a task runs twice because of failures, the accumulators get 
incremented twice.

KMeans uses accumulators in ShuffleMapTasks.  This means that a task's cost can 
end up being double-counted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5489) KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create (I)Lscala/runtime/IntRef;

2015-01-29 Thread DeepakVohra (JIRA)
DeepakVohra created SPARK-5489:
--

 Summary: KMeans clustering java.lang.NoSuchMethodError: 
scala.runtime.IntRef.create  (I)Lscala/runtime/IntRef;
 Key: SPARK-5489
 URL: https://issues.apache.org/jira/browse/SPARK-5489
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
 Environment: Spark 1.2 
Maven
Reporter: DeepakVohra


The KMeans clustering generates following error, which also seems to be due 
version mismatch between Scala used for compiling Spark and Scala in Spark 1.2 
Maven dependency. 

Exception in thread "main" java.lang.NoSuchMethodError: 
scala.runtime.IntRef.create

(I)Lscala/runtime/IntRef;
at 

org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:282)
at 

org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:155)
at 

org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:132)
at 

org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:352)
at 

org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:362)
at 

org.apache.spark.mllib.clustering.KMeans.train(KMeans.scala)
at 

clusterer.kmeans.KMeansClusterer.main(KMeansClusterer.java:35)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5483) java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;

2015-01-29 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297886#comment-14297886
 ] 

DeepakVohra commented on SPARK-5483:


Sean,

As indicated Spark is compiled with Scala 2.10, but the Scala version packaged 
in Maven Spark 1.2  is 2.10.4, which seems to be causing version mismatch and 
the error. 

Spark 1.2 should be packaged with Scala 2.10 instead of 2.10.4. 

thanks,
Deepak

> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
> ---
>
> Key: SPARK-5483
> URL: https://issues.apache.org/jira/browse/SPARK-5483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
> Environment: Maven
> Spark 1.2
>Reporter: DeepakVohra
>
> Naive Bayes classifier generates following error.
> ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   at breeze.linalg.DenseVector.(DenseVector.scala:63)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55)
>   at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:112)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:110)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:200)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:199)
>   at 
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:142)
>   at 
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:205)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/01/28 21:50:06 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-0,5,main]
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188)
>   at 
> breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303)
>   at 
> breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87)
>   at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303)
>   at 
> breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38)
>   at 
> breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22)
>   at breeze.linalg.DenseVector$.(DenseVector.scala:225)
>   at breeze.linalg.DenseVector$.(DenseVector.scala)
>   at breeze.linalg.DenseVector.(DenseVector.scala:63)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50)
>   at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55)
>   at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:112)
>   at 
> org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:110)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:200)
>   at 

[jira] [Commented] (SPARK-5486) Add validate function for BlockMatrix

2015-01-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297859#comment-14297859
 ] 

Apache Spark commented on SPARK-5486:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/4279

> Add validate function for BlockMatrix
> -
>
> Key: SPARK-5486
> URL: https://issues.apache.org/jira/browse/SPARK-5486
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Burak Yavuz
>Priority: Critical
>
> BlockMatrix needs a validate method to make debugging easy for users. 
> It will be an expensive method to perform, but it would be useful for users 
> to know why `multiply` or `add` didn't work properly.
> Things to validate:
> - MatrixBlocks that are not on the edges should have the dimensions 
> `rowsPerBlock` and `colsPerBlock`.
> - There should be at most one block for each index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-603) add simple Counter API

2015-01-29 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reopened SPARK-603:
--

> add simple Counter API
> --
>
> Key: SPARK-603
> URL: https://issues.apache.org/jira/browse/SPARK-603
> Project: Spark
>  Issue Type: New Feature
>Priority: Minor
>
> Users need a very simple way to create counters in their jobs.  Accumulators 
> provide a way to do this, but are a little clunky, for two reasons:
> 1) the setup is a nuisance
> 2) w/ delayed evaluation, you don't know when it will actually run, so its 
> hard to look at the values
> consider this code:
> {code}
> def filterBogus(rdd:RDD[MyCustomClass], sc: SparkContext) = {
>   val filterCount = sc.accumulator(0)
>   val filtered = rdd.filter{r =>
> if (isOK(r)) true else {filterCount += 1; false}
>   }
>   println("removed " + filterCount.value + " records)
>   filtered
> }
> {code}
> The println will always say 0 records were filtered, because its printed 
> before anything has actually run.  I could print out the value later on, but 
> note that it would destroy the modularity of the method -- kinda ugly to 
> return the accumulator just so that it can get printed later on.  (and of 
> course, the caller in turn might not know when the filter is going to get 
> applied, and would have to pass the accumulator up even further ...)
> I'd like to have Counters which just automatically get printed out whenever a 
> stage has been run, and also with some api to get them back.  I realize this 
> is tricky b/c a stage can get re-computed, so maybe you should only increment 
> the counters once.
> Maybe a more general way to do this is to provide some callback for whenever 
> an RDD is computed -- by default, you would just print the counters, but the 
> user could replace w/ a custom handler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5464) Calling help() on a Python DataFrame fails with "cannot resolve column name __name__" error

2015-01-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297842#comment-14297842
 ] 

Apache Spark commented on SPARK-5464:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4278

> Calling help() on a Python DataFrame fails with "cannot resolve column name 
> __name__" error
> ---
>
> Key: SPARK-5464
> URL: https://issues.apache.org/jira/browse/SPARK-5464
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> Trying to call {{help()}} on a Python DataFrame fails with an exception:
> {code}
> >>> help(df)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/joshrosen/anaconda/lib/python2.7/site.py", line 464, in 
> __call__
> return pydoc.help(*args, **kwds)
>   File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1787, in 
> __call__
> self.help(request)
>   File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1834, in help
> else: doc(request, 'Help on %s:')
>   File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1571, in doc
> pager(render_doc(thing, title, forceload))
>   File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1545, in 
> render_doc
> object, name = resolve(thing, forceload)
>   File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1540, in 
> resolve
> name = getattr(thing, '__name__', None)
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, 
> in __getattr__
> return Column(self._jdf.apply(name))
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o31.apply.
> : java.lang.RuntimeException: Cannot resolve column name "__name__"
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:123)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:123)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Here's a reproduction:
> {code}
> >>> from pyspark.sql import SQLContext, Row
> >>> sqlContext = SQLContext(sc)
> >>> rdd = sc.parallelize(['{"foo":"bar"}', '{"foo":"baz"}'])
> >>> df = sqlContext.jsonRDD(rdd)
> >>> help(df)
> {code}
> I think the problem here is that we don't throw the expected exception from 
> our overloaded {{getattr}} if a column can't be found.
> We should be able to fix this by only attempting to call {{apply}} after 
> checking that the column name is valid (e.g. check against {{columns}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3888) Limit the memory used by python worker

2015-01-29 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-3888.
-
Resolution: Won't Fix

> Limit the memory used by python worker
> --
>
> Key: SPARK-3888
> URL: https://issues.apache.org/jira/browse/SPARK-3888
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, we did not limit the memory by Python workers, then it maybe run 
> out of memory and freeze the OS. it's safe to have a configurable hard 
> limitation for it, which should be large than spark.executor.python.memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4939) Python updateStateByKey example hang in local mode

2015-01-29 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-4939:
--
Affects Version/s: (was: 1.2.0)

> Python updateStateByKey example hang in local mode
> --
>
> Key: SPARK-4939
> URL: https://issues.apache.org/jira/browse/SPARK-4939
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, Streaming
>Affects Versions: 1.3.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5151) Parquet Predicate Pushdown Does Not Work with Nested Structures.

2015-01-29 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-5151:
--
Component/s: (was: Spark Core)

> Parquet Predicate Pushdown Does Not Work with Nested Structures.
> 
>
> Key: SPARK-5151
> URL: https://issues.apache.org/jira/browse/SPARK-5151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: pyspark, spark-ec2 created cluster
>Reporter: Brad Willard
>  Labels: parquet, pyspark, sql
>
> I have json files of objects created with a nested structure roughly of the 
> formof the form:
> { id: 123, event: "login", meta_data: {'user: "user1"}}
> 
> { id: 125, event: "login", meta_data: {'user: "user2"}}
> I load the data via spark with
> rdd = sql_context.jsonFile()
> # save it as a parquet file
> rdd.saveAsParquetFile()
> rdd = sql_context.parquetFile()
> rdd.registerTempTable('events')
> so if I run this query it works without issue if predicate pushdown is 
> disabled
> select count(1) from events where meta_data.user = "user1"
> if I enable predicate pushdown I get an error saying meta_data.user is not in 
> the schema
> Py4JJavaError: An error occurred while calling o218.collect.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 125 
> in stage 12.0 failed 4 times, most recent failure: Lost task 125.3 in stage 
> 12.0 (TID 6164, ): java.lang.IllegalArgumentException: Column [user] was not 
> found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
> .
> I expect this is actually related to another bug I filed where nested 
> structure is not preserved with spark sql.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5151) Parquet Predicate Pushdown Does Not Work with Nested Structures.

2015-01-29 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-5151:
--
Component/s: SQL

> Parquet Predicate Pushdown Does Not Work with Nested Structures.
> 
>
> Key: SPARK-5151
> URL: https://issues.apache.org/jira/browse/SPARK-5151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: pyspark, spark-ec2 created cluster
>Reporter: Brad Willard
>  Labels: parquet, pyspark, sql
>
> I have json files of objects created with a nested structure roughly of the 
> formof the form:
> { id: 123, event: "login", meta_data: {'user: "user1"}}
> 
> { id: 125, event: "login", meta_data: {'user: "user2"}}
> I load the data via spark with
> rdd = sql_context.jsonFile()
> # save it as a parquet file
> rdd.saveAsParquetFile()
> rdd = sql_context.parquetFile()
> rdd.registerTempTable('events')
> so if I run this query it works without issue if predicate pushdown is 
> disabled
> select count(1) from events where meta_data.user = "user1"
> if I enable predicate pushdown I get an error saying meta_data.user is not in 
> the schema
> Py4JJavaError: An error occurred while calling o218.collect.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 125 
> in stage 12.0 failed 4 times, most recent failure: Lost task 125.3 in stage 
> 12.0 (TID 6164, ): java.lang.IllegalArgumentException: Column [user] was not 
> found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
> .
> I expect this is actually related to another bug I filed where nested 
> structure is not preserved with spark sql.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5464) Calling help() on a Python DataFrame fails with "cannot resolve column name __name__" error

2015-01-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-5464:
-

Assignee: Josh Rosen

> Calling help() on a Python DataFrame fails with "cannot resolve column name 
> __name__" error
> ---
>
> Key: SPARK-5464
> URL: https://issues.apache.org/jira/browse/SPARK-5464
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> Trying to call {{help()}} on a Python DataFrame fails with an exception:
> {code}
> >>> help(df)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/joshrosen/anaconda/lib/python2.7/site.py", line 464, in 
> __call__
> return pydoc.help(*args, **kwds)
>   File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1787, in 
> __call__
> self.help(request)
>   File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1834, in help
> else: doc(request, 'Help on %s:')
>   File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1571, in doc
> pager(render_doc(thing, title, forceload))
>   File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1545, in 
> render_doc
> object, name = resolve(thing, forceload)
>   File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1540, in 
> resolve
> name = getattr(thing, '__name__', None)
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, 
> in __getattr__
> return Column(self._jdf.apply(name))
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o31.apply.
> : java.lang.RuntimeException: Cannot resolve column name "__name__"
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:123)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:123)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Here's a reproduction:
> {code}
> >>> from pyspark.sql import SQLContext, Row
> >>> sqlContext = SQLContext(sc)
> >>> rdd = sc.parallelize(['{"foo":"bar"}', '{"foo":"baz"}'])
> >>> df = sqlContext.jsonRDD(rdd)
> >>> help(df)
> {code}
> I think the problem here is that we don't throw the expected exception from 
> our overloaded {{getattr}} if a column can't be found.
> We should be able to fix this by only attempting to call {{apply}} after 
> checking that the column name is valid (e.g. check against {{columns}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5445) Make sure DataFrame expressions are usable in Java

2015-01-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297714#comment-14297714
 ] 

Apache Spark commented on SPARK-5445:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4276

> Make sure DataFrame expressions are usable in Java
> --
>
> Key: SPARK-5445
> URL: https://issues.apache.org/jira/browse/SPARK-5445
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.3.0
>
>
> Some DataFrame expressions are not exactly usable in Java. For example, 
> aggregate functions are only defined in the dsl package object, which is 
> painful to use. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5192) Parquet fails to parse schema contains '\r'

2015-01-29 Thread Rekha Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297710#comment-14297710
 ] 

Rekha Joshi commented on SPARK-5192:


I have made a parquet patch on it.thanks

> Parquet fails to parse schema contains '\r'
> ---
>
> Key: SPARK-5192
> URL: https://issues.apache.org/jira/browse/SPARK-5192
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: windows7 + Intellj idea 13.0.2 
>Reporter: cen yuhai
>Priority: Minor
> Fix For: 1.3.0
>
>
> I think this is actually a bug in parquet, when i debuged 'ParquetTestData', 
> i found a exception as below. So i  download the source of MessageTypeParser, 
> the funtion 'isWhitespace' do not check for '\r'
> private boolean isWhitespace(String t) {
>   return t.equals(" ") || t.equals("\t") || t.equals("\n");
> }
> So I replace all '\r' to work around this issue.
>   val subTestSchema =
> """
>   message myrecord {
>   optional boolean myboolean;
>   optional int64 mylong;
>   }
> """.replaceAll("\r","")
> at line 0: message myrecord {
>   at 
> parquet.schema.MessageTypeParser.asRepetition(MessageTypeParser.java:203)
>   at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:101)
>   at 
> parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:96)
>   at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:89)
>   at 
> parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:79)
>   at 
> org.apache.spark.sql.parquet.ParquetTestData$.writeFile(ParquetTestData.scala:221)
>   at 
> org.apache.spark.sql.parquet.ParquetQuerySuite.beforeAll(ParquetQuerySuite.scala:92)
>   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
>   at 
> org.apache.spark.sql.parquet.ParquetQuerySuite.beforeAll(ParquetQuerySuite.scala:85)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
>   at 
> org.apache.spark.sql.parquet.ParquetQuerySuite.run(ParquetQuerySuite.scala:85)
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5488) SPARK_LOCAL_IP not read by mesos scheduler

2015-01-29 Thread Martin Tapp (JIRA)
Martin Tapp created SPARK-5488:
--

 Summary: SPARK_LOCAL_IP not read by mesos scheduler
 Key: SPARK-5488
 URL: https://issues.apache.org/jira/browse/SPARK-5488
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.1.1
Reporter: Martin Tapp
Priority: Minor


My environment sets SPARK_LOCAL_IP and my driver sees it. But mesos sees the 
one from my first available network adapter.

I can even see that SPARK_LOCAL_IP is read correctly by Utils.localHostName and 
Utils.localIpAddress (core/src/main/scala/org/apache/spark/util/Utils.scala). 
Seems spark mesos framework doesn't use it.

Work around for now is to disable my first adapter such that the second one 
becomes the one seen by Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5461) Graph should have isCheckpointed, getCheckpointFiles methods

2015-01-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297663#comment-14297663
 ] 

Joseph K. Bradley commented on SPARK-5461:
--

That sounds great if partitionsRDD can be non-transient.  I'll try it but may 
need to ask for your help about the bugs.  I'll ping you on the PR if so.  
Thanks!

> Graph should have isCheckpointed, getCheckpointFiles methods
> 
>
> Key: SPARK-5461
> URL: https://issues.apache.org/jira/browse/SPARK-5461
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> Graph has a checkpoint method but does not have other helper functionality 
> which RDD has.  Proposal:
> {code}
>   /**
>* Return whether this Graph has been checkpointed or not
>*/
>   def isCheckpointed: Boolean
>   /**
>* Gets the name of the files to which this Graph was checkpointed
>*/
>   def getCheckpointFiles: Seq[String]
> {code}
> I need this for [SPARK-1405].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5466) Build Error caused by Guava shading in Spark

2015-01-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5466.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Marcelo Vanzin

Thanks [~vanzin] for quickly fixing this!

> Build Error caused by Guava shading in Spark
> 
>
> Key: SPARK-5466
> URL: https://issues.apache.org/jira/browse/SPARK-5466
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.0
>Reporter: Jian Zhou
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 1.3.0
>
>
> Guava is shaded inside spark-core itself.
> https://github.com/apache/spark/commit/37a5e272f898e946c09c2e7de5d1bda6f27a8f39
> This causes build error in multiple components, including Graph/MLLib/SQL, 
> when package com.google.common on the classpath incompatible with the version 
> used when compiling Utils.class
> [error] bad symbolic reference. A signature in Utils.class refers to term util
> [error] in package com.google.common which is not available.
> [error] It may be completely missing from the current classpath, or the 
> version on
> [error] the classpath might be incompatible with the version used when 
> compiling Utils.class.
> [error] 
> [error]  while compiling: 
> /spark/graphx/src/main/scala/org/apache/spark/graphx/util/BytecodeUtils.scala
> [error] during phase: erasure
> [error]  library version: version 2.10.4
> [error] compiler version: version 2.10.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-01-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297652#comment-14297652
 ] 

Joseph K. Bradley commented on SPARK-5021:
--

You can also generate the documentation yourself: 
[https://github.com/apache/spark/blob/master/docs/README.md]

> GaussianMixtureEM should be faster for SparseVector input
> -
>
> Key: SPARK-5021
> URL: https://issues.apache.org/jira/browse/SPARK-5021
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
>
> GaussianMixtureEM currently converts everything to dense vectors.  It would 
> be nice if it were faster for SparseVectors (running in time linear in the 
> number of non-zero values).
> However, this may not be too important since clustering should rarely be done 
> in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-01-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5021:
-
Affects Version/s: (was: 1.2.0)
   1.3.0

> GaussianMixtureEM should be faster for SparseVector input
> -
>
> Key: SPARK-5021
> URL: https://issues.apache.org/jira/browse/SPARK-5021
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
>
> GaussianMixtureEM currently converts everything to dense vectors.  It would 
> be nice if it were faster for SparseVectors (running in time linear in the 
> number of non-zero values).
> However, this may not be too important since clustering should rarely be done 
> in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture

2015-01-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5400:
-
Assignee: Travis Galoppo

> Rename GaussianMixtureEM to GaussianMixture
> ---
>
> Key: SPARK-5400
> URL: https://issues.apache.org/jira/browse/SPARK-5400
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Travis Galoppo
>Priority: Minor
>
> GaussianMixtureEM is following the old naming convention of including the 
> optimization algorithm name in the class title.  We should probably rename it 
> to GaussianMixture so that it can use other optimization algorithms in the 
> future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture

2015-01-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297648#comment-14297648
 ] 

Joseph K. Bradley commented on SPARK-5400:
--

Thanks!  Could you also please change the name of the test suite to match?

> Rename GaussianMixtureEM to GaussianMixture
> ---
>
> Key: SPARK-5400
> URL: https://issues.apache.org/jira/browse/SPARK-5400
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Travis Galoppo
>Priority: Minor
>
> GaussianMixtureEM is following the old naming convention of including the 
> optimization algorithm name in the class title.  We should probably rename it 
> to GaussianMixture so that it can use other optimization algorithms in the 
> future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-01-29 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297634#comment-14297634
 ] 

Travis Galoppo commented on SPARK-5021:
---

[~josephkb] This ticket is marked as affecting version 1.2.0 ... this should be 
1.3.0 ?

> GaussianMixtureEM should be faster for SparseVector input
> -
>
> Key: SPARK-5021
> URL: https://issues.apache.org/jira/browse/SPARK-5021
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
>
> GaussianMixtureEM currently converts everything to dense vectors.  It would 
> be nice if it were faster for SparseVectors (running in time linear in the 
> number of non-zero values).
> However, this may not be too important since clustering should rarely be done 
> in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-01-29 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297622#comment-14297622
 ] 

Travis Galoppo commented on SPARK-5021:
---

[~MechCoder] The documentation for GMM is not yet completed (see SPARK-5013) 
... the python interface is still being completed (SPARK-5012) and then the 
documentation can be completed.  In the mean time, I might be able to answer 
your questions around the GMM code...


> GaussianMixtureEM should be faster for SparseVector input
> -
>
> Key: SPARK-5021
> URL: https://issues.apache.org/jira/browse/SPARK-5021
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
>
> GaussianMixtureEM currently converts everything to dense vectors.  It would 
> be nice if it were faster for SparseVectors (running in time linear in the 
> number of non-zero values).
> However, this may not be too important since clustering should rarely be done 
> in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture

2015-01-29 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297613#comment-14297613
 ] 

Travis Galoppo commented on SPARK-5400:
---

Please assign to me and I will make the name change


> Rename GaussianMixtureEM to GaussianMixture
> ---
>
> Key: SPARK-5400
> URL: https://issues.apache.org/jira/browse/SPARK-5400
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> GaussianMixtureEM is following the old naming convention of including the 
> optimization algorithm name in the class title.  We should probably rename it 
> to GaussianMixture so that it can use other optimization algorithms in the 
> future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5322) Add transpose() to BlockMatrix

2015-01-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297486#comment-14297486
 ] 

Apache Spark commented on SPARK-5322:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/4275

> Add transpose() to BlockMatrix
> --
>
> Key: SPARK-5322
> URL: https://issues.apache.org/jira/browse/SPARK-5322
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Burak Yavuz
>
> Once Local matrices have the option to transpose, transposing a BlockMatrix 
> will be trivial. Again, this will be a flag, which will in the end affect 
> every SubMatrix in the RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4768) Add Support For Impala Encoded Timestamp (INT96)

2015-01-29 Thread Taiji Okada (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297475#comment-14297475
 ] 

Taiji Okada commented on SPARK-4768:


[~yhuai], I've uploaded the string_timestamp tarball. It also includes a 
nanosecond precision timestamp value.

Repro:
create table string_timestamp
(
dummy string,
timestamp1 timestamp
) stored as parquet;

insert into string_timestamp (dummy,timestamp1) values('test row 1', 
'2015-01-02 20:54:05');
insert into string_timestamp (dummy,timestamp1) values('test row 2', 
'1900-01-01');
insert into string_timestamp (dummy,timestamp1) values('test row 3', 
'-12-31');
insert into string_timestamp (dummy,timestamp1) values('test row 4', null);
insert into string_timestamp (dummy,timestamp1) values('test row 5', 
'2015-01-02 20:54:10.123456789');
select * from string_timestamp;


> Add Support For Impala Encoded Timestamp (INT96)
> 
>
> Key: SPARK-4768
> URL: https://issues.apache.org/jira/browse/SPARK-4768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Pat McDonough
>Priority: Critical
> Attachments: 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq, 
> string_timestamp.gz
>
>
> Impala is using INT96 for timestamps. Spark SQL should be able to read this 
> data despite the fact that it is not part of the spec.
> Perhaps adding a flag to act like impala when reading parquet (like we do for 
> strings already) would be useful.
> Here's an example of the error you might see:
> {code}
> Caused by: java.lang.RuntimeException: Potential loss of precision: cannot 
> convert INT96
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441)
> at 
> org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:66)
> at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4768) Add Support For Impala Encoded Timestamp (INT96)

2015-01-29 Thread Taiji Okada (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taiji Okada updated SPARK-4768:
---
Attachment: string_timestamp.gz

> Add Support For Impala Encoded Timestamp (INT96)
> 
>
> Key: SPARK-4768
> URL: https://issues.apache.org/jira/browse/SPARK-4768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Pat McDonough
>Priority: Critical
> Attachments: 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq, 
> string_timestamp.gz
>
>
> Impala is using INT96 for timestamps. Spark SQL should be able to read this 
> data despite the fact that it is not part of the spec.
> Perhaps adding a flag to act like impala when reading parquet (like we do for 
> strings already) would be useful.
> Here's an example of the error you might see:
> {code}
> Caused by: java.lang.RuntimeException: Potential loss of precision: cannot 
> convert INT96
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441)
> at 
> org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:66)
> at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5487) Dockerfile to build spark's custom akka.

2015-01-29 Thread jay vyas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297458#comment-14297458
 ] 

jay vyas edited comment on SPARK-5487 at 1/29/15 7:57 PM:
--

To reproduce this, you can use the following dockerfile.  Hopefully a few minor 
modifications will result in a Dockerfile that we can use to build spark's 
*critical* akka dependency from scratch.  

{noformat}
FROM silarsis/base
RUN apt-get -yq update && apt-get -yq install openjdk-7-jdk
RUN wget -q -O /tmp/sbt.tgz 
http://scalasbt.artifactoryonline.com/scalasbt/sbt-native-packages/org/scala-sbt/sbt/0.12.4/sbt.tgz
 \
&& cd /usr/local \
&& tar zxf /tmp/sbt.tgz
ENV PATH $PATH:/usr/local/sbt/bin
VOLUME /opt/progfun
WORKDIR /opt/progfun
RUN /usr/local/sbt/bin/sbt version
RUN cd /tmp && git clone https://github.com/pwendell/akka && cd /tmp/akka && 
git checkout 2.2.3-shaded-proto
RUN cd /tmp/akka/
RUN cd /tmp/akka && sbt compile
CMD ["/bin/bash"]
{noformat}


was (Author: jayunit100):
To reproduce this, you can use the following dockerfile.

{noformat}
FROM silarsis/base
RUN apt-get -yq update && apt-get -yq install openjdk-7-jdk
RUN wget -q -O /tmp/sbt.tgz 
http://scalasbt.artifactoryonline.com/scalasbt/sbt-native-packages/org/scala-sbt/sbt/0.12.4/sbt.tgz
 \
&& cd /usr/local \
&& tar zxf /tmp/sbt.tgz
ENV PATH $PATH:/usr/local/sbt/bin
VOLUME /opt/progfun
WORKDIR /opt/progfun
RUN /usr/local/sbt/bin/sbt version
RUN cd /tmp && git clone https://github.com/pwendell/akka && cd /tmp/akka && 
git checkout 2.2.3-shaded-proto
RUN cd /tmp/akka/
RUN cd /tmp/akka && sbt compile
CMD ["/bin/bash"]
{noformat}

> Dockerfile to build spark's custom akka.
> 
>
> Key: SPARK-5487
> URL: https://issues.apache.org/jira/browse/SPARK-5487
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: jay vyas
>
> Building spark's custom shaed akka version is tricky.  The code is in 
> https://github.com/pwendell/akka/ (branch = 2.2.3-shaded-proto) , however, 
> when attempting to build, I receive some strange errors.
> I've attempted to fork off of a Dockerfile for {{SBT 0.12.4}}, which I'll 
> attach in a snippet just as an example of what we might want to facilitate 
> building the spark specific akka until SPARK-5293 is completed.
> {noformat}
> [info] Compiling 6 Scala sources and 1 Java source to 
> /tmp/akka/akka-multi-node-testkit/target/classes...
> [warn] Class com.google.protobuf.MessageLite not found - continuing with a 
> stub.
> [error] error while loading ProtobufDecoder, class file 
> '/root/.ivy2/cache/io.netty/netty/bundles/netty-3.6.6.Final.jar(org/jboss/netty/handler/codec/protobuf/ProtobufDecoder.class)'
>  is broken
> [error] (class java.lang.NullPointerException/null)
> [error] 
> /tmp/akka/akka-multi-node-testkit/src/main/scala/akka/remote/testconductor/RemoteConnection.scala:24:
>  org.jboss.netty.handler.codec.protobuf.ProtobufDecoder does not have a 
> constructor
> [error] val proto = List(new ProtobufEncoder, new 
> ProtobufDecoder(TestConductorProtocol.Wrapper.getDefaultInstance))
> [error]   ^
> [error] 
> /tmp/akka/akka-multi-node-testkit/src/main/scala/akka/remote/testkit/MultiNodeSpec.scala:267:
>  value await is not a member of 
> scala.concurrent.Future[Iterable[akka.remote.testconductor.RoleName]]
> [error]  Note: implicit method awaitHelper is not applicable here because it 
> comes after the application point and it lacks an explicit result type
> [error]   testConductor.getNodes.await.filterNot(_ == myself).isEmpty
> [error]  ^
> [error] 
> /tmp/akka/akka-multi-node-testkit/src/main/scala/akka/remote/testkit/MultiNodeSpec.scala:354:
>  value await is not a member of scala.concurrent.Future[akka.actor.Address]
> [error]  Note: implicit method awaitHelper is not applicable here because it 
> comes after the application point and it lacks an explicit result type
> [error]   def node(role: RoleName): ActorPath = 
> RootActorPath(testConductor.getAddressFor(role).await)
> [error]   
>   ^
> [warn] one warning found
> [error] four errors found
> [info] Updating {file:/tmp/akka/}akka-docs...
> [info] Done updating.
> [info] Updating {file:/tmp/akka/}akka-contrib...
> [info] Done updating.
> [info] Updating {file:/tmp/akka/}akka-sample-osgi-dining-hakkers-core...
> [info] Done updating.
> [info] Compiling 17 Scala sources to /tmp/akka/akka-cluster/target/classes...
> [error] 
> /tmp/akka/akka-cluster/src/main/scala/akka/cluster/protobuf/ClusterMessageSerializer.scala:59:
>  type mismatch;
> [error]  found   : akka.cluster.protobuf.msg.GossipEnvelope
> [error]  required: com.google.protobuf_spark.MessageLite

[jira] [Commented] (SPARK-5487) Dockerfile to build spark's custom akka.

2015-01-29 Thread jay vyas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297458#comment-14297458
 ] 

jay vyas commented on SPARK-5487:
-

To reproduce this, you can use the following dockerfile.

{noformat}
FROM silarsis/base
RUN apt-get -yq update && apt-get -yq install openjdk-7-jdk
RUN wget -q -O /tmp/sbt.tgz 
http://scalasbt.artifactoryonline.com/scalasbt/sbt-native-packages/org/scala-sbt/sbt/0.12.4/sbt.tgz
 \
&& cd /usr/local \
&& tar zxf /tmp/sbt.tgz
ENV PATH $PATH:/usr/local/sbt/bin
VOLUME /opt/progfun
WORKDIR /opt/progfun
RUN /usr/local/sbt/bin/sbt version
RUN cd /tmp && git clone https://github.com/pwendell/akka && cd /tmp/akka && 
git checkout 2.2.3-shaded-proto
RUN cd /tmp/akka/
RUN cd /tmp/akka && sbt compile
CMD ["/bin/bash"]
{noformat}

> Dockerfile to build spark's custom akka.
> 
>
> Key: SPARK-5487
> URL: https://issues.apache.org/jira/browse/SPARK-5487
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: jay vyas
>
> Building spark's custom shaed akka version is tricky.  The code is in 
> https://github.com/pwendell/akka/ (branch = 2.2.3-shaded-proto) , however, 
> when attempting to build, I receive some strange errors.
> I've attempted to fork off of a Dockerfile for {{SBT 0.12.4}}, which I'll 
> attach in a snippet just as an example of what we might want to facilitate 
> building the spark specific akka until SPARK-5293 is completed.
> {noformat}
> [info] Compiling 6 Scala sources and 1 Java source to 
> /tmp/akka/akka-multi-node-testkit/target/classes...
> [warn] Class com.google.protobuf.MessageLite not found - continuing with a 
> stub.
> [error] error while loading ProtobufDecoder, class file 
> '/root/.ivy2/cache/io.netty/netty/bundles/netty-3.6.6.Final.jar(org/jboss/netty/handler/codec/protobuf/ProtobufDecoder.class)'
>  is broken
> [error] (class java.lang.NullPointerException/null)
> [error] 
> /tmp/akka/akka-multi-node-testkit/src/main/scala/akka/remote/testconductor/RemoteConnection.scala:24:
>  org.jboss.netty.handler.codec.protobuf.ProtobufDecoder does not have a 
> constructor
> [error] val proto = List(new ProtobufEncoder, new 
> ProtobufDecoder(TestConductorProtocol.Wrapper.getDefaultInstance))
> [error]   ^
> [error] 
> /tmp/akka/akka-multi-node-testkit/src/main/scala/akka/remote/testkit/MultiNodeSpec.scala:267:
>  value await is not a member of 
> scala.concurrent.Future[Iterable[akka.remote.testconductor.RoleName]]
> [error]  Note: implicit method awaitHelper is not applicable here because it 
> comes after the application point and it lacks an explicit result type
> [error]   testConductor.getNodes.await.filterNot(_ == myself).isEmpty
> [error]  ^
> [error] 
> /tmp/akka/akka-multi-node-testkit/src/main/scala/akka/remote/testkit/MultiNodeSpec.scala:354:
>  value await is not a member of scala.concurrent.Future[akka.actor.Address]
> [error]  Note: implicit method awaitHelper is not applicable here because it 
> comes after the application point and it lacks an explicit result type
> [error]   def node(role: RoleName): ActorPath = 
> RootActorPath(testConductor.getAddressFor(role).await)
> [error]   
>   ^
> [warn] one warning found
> [error] four errors found
> [info] Updating {file:/tmp/akka/}akka-docs...
> [info] Done updating.
> [info] Updating {file:/tmp/akka/}akka-contrib...
> [info] Done updating.
> [info] Updating {file:/tmp/akka/}akka-sample-osgi-dining-hakkers-core...
> [info] Done updating.
> [info] Compiling 17 Scala sources to /tmp/akka/akka-cluster/target/classes...
> [error] 
> /tmp/akka/akka-cluster/src/main/scala/akka/cluster/protobuf/ClusterMessageSerializer.scala:59:
>  type mismatch;
> [error]  found   : akka.cluster.protobuf.msg.GossipEnvelope
> [error]  required: com.google.protobuf_spark.MessageLite
> [error]   case m: GossipEnvelope ? compress(gossipEnvelopeToProto(m))
> [error]  ^
> [error] 
> /tmp/akka/akka-cluster/src/main/scala/akka/cluster/protobuf/ClusterMessageSerializer.scala:61:
>  type mismatch;
> [error]  found   : akka.cluster.protobuf.msg.MetricsGossipEnvelope
> [error]  required: com.google.protobuf_spark.MessageLite
> [error]   case m: MetricsGossipEnvelope ? 
> compress(metricsGossipEnvelopeToProto(m))
> [error]   
>  ^
> [error] 
> /tmp/akka/akka-cluster/src/main/scala/akka/cluster/protobuf/ClusterMessageSerializer.scala:63:
>  type mismatch;
> [error]  found   : akka.cluster.protobuf.msg.Welcome
> [error]  required: com.google.protobuf_spark.MessageLite
> [error]  

  1   2   >