[jira] [Resolved] (SPARK-3673) Move IndexedRDD from a pull request into a separate repository
[ https://issues.apache.org/jira/browse/SPARK-3673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-3673. --- Resolution: Fixed > Move IndexedRDD from a pull request into a separate repository > -- > > Key: SPARK-3673 > URL: https://issues.apache.org/jira/browse/SPARK-3673 > Project: Spark > Issue Type: Sub-task > Components: GraphX, Spark Core >Reporter: Ankur Dave >Assignee: Ankur Dave > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3673) Move IndexedRDD from a pull request into a separate repository
[ https://issues.apache.org/jira/browse/SPARK-3673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298314#comment-14298314 ] Alexander Bezzubov commented on SPARK-3673: --- Looks like this was resolved by https://github.com/amplab/spark-indexedrdd > Move IndexedRDD from a pull request into a separate repository > -- > > Key: SPARK-3673 > URL: https://issues.apache.org/jira/browse/SPARK-3673 > Project: Spark > Issue Type: Sub-task > Components: GraphX, Spark Core >Reporter: Ankur Dave >Assignee: Ankur Dave > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5494) SparkSqlSerializer Ignores KryoRegistrators
Hamel Ajay Kothari created SPARK-5494: - Summary: SparkSqlSerializer Ignores KryoRegistrators Key: SPARK-5494 URL: https://issues.apache.org/jira/browse/SPARK-5494 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Hamel Ajay Kothari We should make SparkSqlSerializer call {{super.newKryo}} before doing any of it's custom stuff in order to make sure it picks up on custom KryoRegistrators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5322) Add transpose() to BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-5322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5322. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4275 [https://github.com/apache/spark/pull/4275] > Add transpose() to BlockMatrix > -- > > Key: SPARK-5322 > URL: https://issues.apache.org/jira/browse/SPARK-5322 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Burak Yavuz > Fix For: 1.3.0 > > > Once Local matrices have the option to transpose, transposing a BlockMatrix > will be trivial. Again, this will be a flag, which will in the end affect > every SubMatrix in the RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5322) Add transpose() to BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-5322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5322: - Assignee: Burak Yavuz > Add transpose() to BlockMatrix > -- > > Key: SPARK-5322 > URL: https://issues.apache.org/jira/browse/SPARK-5322 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 1.3.0 > > > Once Local matrices have the option to transpose, transposing a BlockMatrix > will be trivial. Again, this will be a flag, which will in the end affect > every SubMatrix in the RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298192#comment-14298192 ] Michael Hynes commented on SPARK-3080: -- What is the status of this SimpleALS.scala rewrite? Are you planning to merge it into the master branch to replace the current implementation? > ArrayIndexOutOfBoundsException in ALS for Large datasets > > > Key: SPARK-3080 > URL: https://issues.apache.org/jira/browse/SPARK-3080 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.0, 1.2.0 >Reporter: Burak Yavuz >Assignee: Xiangrui Meng > > The stack trace is below: > {quote} > java.lang.ArrayIndexOutOfBoundsException: 2716 > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > > org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > {quote} > This happened after the dataset was sub-sampled. > Dataset properties: ~12B ratings > Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5492) Thread statistics can break with older Hadoop versions
[ https://issues.apache.org/jira/browse/SPARK-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298190#comment-14298190 ] Sandy Ryza commented on SPARK-5492: --- Are you able to provide any more detail on the environment this occurred in? I think all versions of Hadoop that don't expose StatisticsData are also missing a getThreadStatistics method, so they should run into a NoSuchMethodException at https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L160 and not make it down to the ClassNotFoundException. It's probably good to guard against the ClassNotFoundException anyway, but not sure how this would come up. > Thread statistics can break with older Hadoop versions > -- > > Key: SPARK-5492 > URL: https://issues.apache.org/jira/browse/SPARK-5492 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Sandy Ryza >Priority: Blocker > > {code} > java.lang.ClassNotFoundException: > org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:191) > at > org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatisticsMethod(SparkHadoopUtil.scala:180) > at > org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:120) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:118) > at scala.Option.orElse(Option.scala:257) > {code} > I think the issue is we need to catch ClassNotFoundException here: > https://github.com/apache/spark/blob/b1b35ca2e440df40b253bf967bb93705d355c1c0/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L144 > However, I'm really confused how this didn't fail our unit tests, since we > explicitly tried to test this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5493) Support proxy users under kerberos
Brock Noland created SPARK-5493: --- Summary: Support proxy users under kerberos Key: SPARK-5493 URL: https://issues.apache.org/jira/browse/SPARK-5493 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Brock Noland When using kerberos, services may want to use spark-submit to submit jobs as a separate user. For example a service like oozie might want to submit jobs as a client user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5492) Thread statistics can break with older Hadoop versions
[ https://issues.apache.org/jira/browse/SPARK-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298138#comment-14298138 ] Sandy Ryza commented on SPARK-5492: --- Very weird. I'll look into it. Did that come up during a test? > Thread statistics can break with older Hadoop versions > -- > > Key: SPARK-5492 > URL: https://issues.apache.org/jira/browse/SPARK-5492 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Sandy Ryza >Priority: Blocker > > {code} > java.lang.ClassNotFoundException: > org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:191) > at > org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatisticsMethod(SparkHadoopUtil.scala:180) > at > org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:120) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:118) > at scala.Option.orElse(Option.scala:257) > {code} > I think the issue is we need to catch ClassNotFoundException here: > https://github.com/apache/spark/blob/b1b35ca2e440df40b253bf967bb93705d355c1c0/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L144 > However, I'm really confused how this didn't fail our unit tests, since we > explicitly tried to test this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5492) Thread statistics can break with older Hadoop versions
[ https://issues.apache.org/jira/browse/SPARK-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reassigned SPARK-5492: - Assignee: Sandy Ryza > Thread statistics can break with older Hadoop versions > -- > > Key: SPARK-5492 > URL: https://issues.apache.org/jira/browse/SPARK-5492 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Sandy Ryza >Priority: Blocker > > {code} > java.lang.ClassNotFoundException: > org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:191) > at > org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatisticsMethod(SparkHadoopUtil.scala:180) > at > org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:120) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:118) > at scala.Option.orElse(Option.scala:257) > {code} > I think the issue is we need to catch ClassNotFoundException here: > https://github.com/apache/spark/blob/b1b35ca2e440df40b253bf967bb93705d355c1c0/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L144 > However, I'm really confused how this didn't fail our unit tests, since we > explicitly tried to test this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3976) Detect block matrix partitioning schemes
[ https://issues.apache.org/jira/browse/SPARK-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298132#comment-14298132 ] Apache Spark commented on SPARK-3976: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/4286 > Detect block matrix partitioning schemes > > > Key: SPARK-3976 > URL: https://issues.apache.org/jira/browse/SPARK-3976 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > > Provide repartitioning methods for block matrices to repartition matrix for > add/multiply of non-identically partitioned matrices -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3996) Shade Jetty in Spark deliverables
[ https://issues.apache.org/jira/browse/SPARK-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298128#comment-14298128 ] Apache Spark commented on SPARK-3996: - User 'pwendell' has created a pull request for this issue: https://github.com/apache/spark/pull/4285 > Shade Jetty in Spark deliverables > - > > Key: SPARK-3996 > URL: https://issues.apache.org/jira/browse/SPARK-3996 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Mingyu Kim >Assignee: Patrick Wendell > Fix For: 1.3.0 > > > We'd like to use Spark in a Jetty 9 server, and it's causing a version > conflict. Given that Spark's dependency on Jetty is light, it'd be a good > idea to shade this dependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in DataFrames returned from sqlCtx.sql()
[ https://issues.apache.org/jira/browse/SPARK-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5462. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Josh Rosen > Catalyst UnresolvedException "Invalid call to qualifiers on unresolved > object" error when accessing fields in DataFrames returned from sqlCtx.sql() > --- > > Key: SPARK-5462 > URL: https://issues.apache.org/jira/browse/SPARK-5462 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Fix For: 1.3.0 > > > When trying to access fields on a Python DataFrame created via inferSchema, I > ran into a confusing Catalyst Py4J error. Here's a reproduction: > {code} > from pyspark import SparkContext > from pyspark.sql import SQLContext, Row > sc = SparkContext("local", "test") > sqlContext = SQLContext(sc) > # Load a text file and convert each line to a Row. > lines = sc.textFile("examples/src/main/resources/people.txt") > parts = lines.map(lambda l: l.split(",")) > people = parts.map(lambda p: Row(name=p[0], age=int(p[1]))) > # Infer the schema, and register the SchemaRDD as a table. > schemaPeople = sqlContext.inferSchema(people) > schemaPeople.registerTempTable("people") > # SQL can be run over SchemaRDDs that have been registered as a table. > teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age > <= 19") > print teenagers.name > {code} > This fails with the following error: > {code} > Traceback (most recent call last): > File "/Users/joshrosen/Documents/spark/sqltest.py", line 19, in > print teenagers.name > File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, > in __getattr__ > return Column(self._jdf.apply(name)) > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o66.apply. > : org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > qualifiers on unresolved object, tree: 'name > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50) > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > This is distinct from the helpful error message that I get when trying to > access a non-existent column. This error didn't occur when I tried the same > thing with a DataFrame created via jsonRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mai
[jira] [Commented] (SPARK-5492) Thread statistics can break with older Hadoop versions
[ https://issues.apache.org/jira/browse/SPARK-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298075#comment-14298075 ] Patrick Wendell commented on SPARK-5492: /cc [~sandyr] > Thread statistics can break with older Hadoop versions > -- > > Key: SPARK-5492 > URL: https://issues.apache.org/jira/browse/SPARK-5492 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Priority: Blocker > > {code} > java.lang.ClassNotFoundException: > org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:191) > at > org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatisticsMethod(SparkHadoopUtil.scala:180) > at > org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:120) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:118) > at scala.Option.orElse(Option.scala:257) > {code} > I think the issue is we need to catch ClassNotFoundException here: > https://github.com/apache/spark/blob/b1b35ca2e440df40b253bf967bb93705d355c1c0/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L144 > However, I'm really confused how this didn't fail our unit tests, since we > explicitly tried to test this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5492) Thread statistics can break with older Hadoop versions
[ https://issues.apache.org/jira/browse/SPARK-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5492: --- Priority: Blocker (was: Major) > Thread statistics can break with older Hadoop versions > -- > > Key: SPARK-5492 > URL: https://issues.apache.org/jira/browse/SPARK-5492 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Priority: Blocker > > {code} > java.lang.ClassNotFoundException: > org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:191) > at > org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatisticsMethod(SparkHadoopUtil.scala:180) > at > org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:120) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:118) > at scala.Option.orElse(Option.scala:257) > {code} > I think the issue is we need to catch ClassNotFoundException here: > https://github.com/apache/spark/blob/b1b35ca2e440df40b253bf967bb93705d355c1c0/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L144 > However, I'm really confused how this didn't fail our unit tests, since we > explicitly tried to test this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5492) Thread statistics can break with older Hadoop versions
Patrick Wendell created SPARK-5492: -- Summary: Thread statistics can break with older Hadoop versions Key: SPARK-5492 URL: https://issues.apache.org/jira/browse/SPARK-5492 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell {code} java.lang.ClassNotFoundException: org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:191) at org.apache.spark.deploy.SparkHadoopUtil.getFileSystemThreadStatisticsMethod(SparkHadoopUtil.scala:180) at org.apache.spark.deploy.SparkHadoopUtil.getFSBytesReadOnThreadCallback(SparkHadoopUtil.scala:139) at org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:120) at org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$2.apply(NewHadoopRDD.scala:118) at scala.Option.orElse(Option.scala:257) {code} I think the issue is we need to catch ClassNotFoundException here: https://github.com/apache/spark/blob/b1b35ca2e440df40b253bf967bb93705d355c1c0/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L144 However, I'm really confused how this didn't fail our unit tests, since we explicitly tried to test this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5489) KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create (I)Lscala/runtime/IntRef;
[ https://issues.apache.org/jira/browse/SPARK-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298063#comment-14298063 ] DeepakVohra commented on SPARK-5489: If Scala 2.11.1 is used the scala.Cloneable is not found, which is available in Scala 2.10.4, but not not Scala 2.11.1. > KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create > (I)Lscala/runtime/IntRef; > - > > Key: SPARK-5489 > URL: https://issues.apache.org/jira/browse/SPARK-5489 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.2.0 > Environment: Spark 1.2 > Maven >Reporter: DeepakVohra > > The KMeans clustering generates following error, which also seems to be due > version mismatch between Scala used for compiling Spark and Scala in Spark > 1.2 Maven dependency. > Exception in thread "main" java.lang.NoSuchMethodError: > scala.runtime.IntRef.create > (I)Lscala/runtime/IntRef; > at > org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:282) > at > org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:155) > at > org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:132) > at > org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:352) > at > org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:362) > at > org.apache.spark.mllib.clustering.KMeans.train(KMeans.scala) > at > clusterer.kmeans.KMeansClusterer.main(KMeansClusterer.java:35) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5454) [SQL] Self join with ArrayType columns problems
[ https://issues.apache.org/jira/browse/SPARK-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298048#comment-14298048 ] Apache Spark commented on SPARK-5454: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/4284 > [SQL] Self join with ArrayType columns problems > --- > > Key: SPARK-5454 > URL: https://issues.apache.org/jira/browse/SPARK-5454 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Pierre Borckmans > > Weird behaviour when performing self join on a table with some ArrayType > field. (potential bug ?) > I have set up a minimal non working example here: > https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f > In a nutshell, if the ArrayType column used for the pivot is created manually > in the StructType definition, everything works as expected. > However, if the ArrayType pivot column is obtained by a sql query (be it by > using a "array" wrapper, or using a collect_list operator for instance), then > results are completely off. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1473) Feature selection for high dimensional datasets
[ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1473: - Assignee: (was: Alexander Ulanov) > Feature selection for high dimensional datasets > --- > > Key: SPARK-1473 > URL: https://issues.apache.org/jira/browse/SPARK-1473 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ignacio Zendejas >Priority: Minor > Labels: features > > For classification tasks involving large feature spaces in the order of tens > of thousands or higher (e.g., text classification with n-grams, where n > 1), > it is often useful to rank and filter features that are irrelevant thereby > reducing the feature space by at least one or two orders of magnitude without > impacting performance on key evaluation metrics (accuracy/precision/recall). > A feature evaluation interface which is flexible needs to be designed and at > least two methods should be implemented with Information Gain being a > priority as it has been shown to be amongst the most reliable. > Special consideration should be taken in the design to account for wrapper > methods (see research papers below) which are more practical for lower > dimensional data. > Relevant research: > * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional > likelihood maximisation: a unifying framework for information theoretic > feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. > * Forman, George. "An extensive empirical study of feature selection metrics > for text classification." The Journal of machine learning research 3 (2003): > 1289-1305. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1473) Feature selection for high dimensional datasets
[ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1473: - Target Version/s: (was: 1.3.0) > Feature selection for high dimensional datasets > --- > > Key: SPARK-1473 > URL: https://issues.apache.org/jira/browse/SPARK-1473 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ignacio Zendejas >Assignee: Alexander Ulanov >Priority: Minor > Labels: features > > For classification tasks involving large feature spaces in the order of tens > of thousands or higher (e.g., text classification with n-grams, where n > 1), > it is often useful to rank and filter features that are irrelevant thereby > reducing the feature space by at least one or two orders of magnitude without > impacting performance on key evaluation metrics (accuracy/precision/recall). > A feature evaluation interface which is flexible needs to be designed and at > least two methods should be implemented with Information Gain being a > priority as it has been shown to be amongst the most reliable. > Special consideration should be taken in the design to account for wrapper > methods (see research papers below) which are more practical for lower > dimensional data. > Relevant research: > * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional > likelihood maximisation: a unifying framework for information theoretic > feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. > * Forman, George. "An extensive empirical study of feature selection metrics > for text classification." The Journal of machine learning research 3 (2003): > 1289-1305. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5491) Chi-square feature selection
Xiangrui Meng created SPARK-5491: Summary: Chi-square feature selection Key: SPARK-5491 URL: https://issues.apache.org/jira/browse/SPARK-5491 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Alexander Ulanov Implement chi-square feature selection. PR: https://github.com/apache/spark/pull/1484 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5395) Large number of Python workers causing resource depletion
[ https://issues.apache.org/jira/browse/SPARK-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5395: -- Target Version/s: 1.3.0, 1.2.2 Fix Version/s: 1.3.0 Assignee: Davies Liu Labels: backport-needed (was: ) I've committed Davies' patch (https://github.com/apache/spark/pull/4238) to {{master}} for inclusion in Spark 1.3.0 and tagged it for later backport to Spark 1.2.2. (I'll cherry-pick the commit after we close the 1.2.1 vote). > Large number of Python workers causing resource depletion > - > > Key: SPARK-5395 > URL: https://issues.apache.org/jira/browse/SPARK-5395 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.0, 1.3.0 > Environment: AWS ElasticMapReduce >Reporter: Sven Krasser >Assignee: Davies Liu > Labels: backport-needed > Fix For: 1.3.0 > > > During job execution a large number of Python worker accumulates eventually > causing YARN to kill containers for being over their memory allocation (in > the case below that is about 8G for executors plus 6G for overhead per > container). > In this instance, at the time of killing the container 97 pyspark.daemon > processes had accumulated. > {noformat} > 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler > (Logging.scala:logInfo(59)) - Container marked as failed: > container_1421692415636_0052_01_30. Exit status: 143. Diagnostics: > Container [pid=35211,containerID=container_1421692415636_0052_01_30] is > running beyond physical memory limits. Current usage: 14.9 GB of 14.5 GB > physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing > container. > Dump of the process-tree for container_1421692415636_0052_01_30 : > |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) > VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m > pyspark.daemon > |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m > pyspark.daemon > |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python -m > pyspark.daemon > [...] > {noformat} > The configuration used uses 64 containers with 2 cores each. > Full output here: https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c > Mailinglist discussion: > https://www.mail-archive.com/user@spark.apache.org/msg20102.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2199) Distributed probabilistic latent semantic analysis in MLlib
[ https://issues.apache.org/jira/browse/SPARK-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2199: - Target Version/s: 1.4.0 (was: 1.3.0) > Distributed probabilistic latent semantic analysis in MLlib > --- > > Key: SPARK-2199 > URL: https://issues.apache.org/jira/browse/SPARK-2199 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Denis Turdakov >Assignee: Valeriy Avanesov > Labels: features > > Probabilistic latent semantic analysis (PLSA) is a topic model which extracts > topics from text corpus. PLSA was historically a predecessor of LDA. However > recent research shows that modifications of PLSA sometimes performs better > then LDA[1]. Furthermore, the most recent paper by same authors shows that > there is a clear way to extend PLSA to LDA and beyond[2]. > We should implement distributed versions of PLSA. In addition it should be > possible to easily add user defined regularizers or combination of them. We > will implement regularizers that allows > * extract sparse topics > * extract human interpretable topics > * perform semi-supervised training > * sort out non-topic specific terms. > [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In > Proceedings of ECIR'13. > [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive > Regularization for Stochastic Matrix Factorization. > http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3080: - Target Version/s: (was: 1.3.0) > ArrayIndexOutOfBoundsException in ALS for Large datasets > > > Key: SPARK-3080 > URL: https://issues.apache.org/jira/browse/SPARK-3080 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.0, 1.2.0 >Reporter: Burak Yavuz >Assignee: Xiangrui Meng > > The stack trace is below: > {quote} > java.lang.ArrayIndexOutOfBoundsException: 2716 > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > > org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > {quote} > This happened after the dataset was sub-sampled. > Dataset properties: ~12B ratings > Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3147) Implement A/B testing
[ https://issues.apache.org/jira/browse/SPARK-3147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3147: - Target Version/s: (was: 1.3.0) > Implement A/B testing > - > > Key: SPARK-3147 > URL: https://issues.apache.org/jira/browse/SPARK-3147 > Project: Spark > Issue Type: New Feature > Components: MLlib, Streaming >Reporter: Xiangrui Meng > > A/B testing is widely used to compare online models. We can implement A/B > testing in MLlib and integrate it with Spark Streaming. For example, we have > a PairDStream[String, Double], whose keys are model ids and values are > observations (click or not, or revenue associated with the event). With A/B > testing, we can tell whether one model is significantly better than another > at a certain time. There are some caveats. For example, we should avoid > multiple testing and support A/A testing as a sanity check. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4259) Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298030#comment-14298030 ] Xiangrui Meng commented on SPARK-4259: -- [~andrew.musselman] PIC is more or less a spectral clustering algorithm. It should produce similar result when there is a significant gap between the second and the third eigenvalues. If there is not such a gap, it creates a weighted combination, which should work well in practice. Feel free to create a new JIRA for the original spectral clustering algorithm. But note that our goal is not to provide reference machine learning implementations. If PIC is an alternative to the original spectral clustering and it is more scalable, we don't want to maintain two implementations. > Add Power Iteration Clustering Algorithm with Gaussian Similarity Function > -- > > Key: SPARK-4259 > URL: https://issues.apache.org/jira/browse/SPARK-4259 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Fan Jiang >Assignee: Fan Jiang > Labels: features > > In recent years, power Iteration clustering has become one of the most > popular modern clustering algorithms. It is simple to implement, can be > solved efficiently by standard linear algebra software, and very often > outperforms traditional clustering algorithms such as the k-means algorithm. > Power iteration clustering is a scalable and efficient algorithm for > clustering points given pointwise mutual affinity values. Internally the > algorithm: > computes the Gaussian distance between all pairs of points and represents > these distances in an Affinity Matrix > calculates a Normalized Affinity Matrix > calculates the principal eigenvalue and eigenvector > Clusters each of the input points according to their principal eigenvector > component value > Details of this algorithm are found within [Power Iteration Clustering, Lin > and Cohen]{www.icml2010.org/papers/387.pdf} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3147) Implement A/B testing
[ https://issues.apache.org/jira/browse/SPARK-3147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3147: - Target Version/s: 1.4.0 > Implement A/B testing > - > > Key: SPARK-3147 > URL: https://issues.apache.org/jira/browse/SPARK-3147 > Project: Spark > Issue Type: New Feature > Components: MLlib, Streaming >Reporter: Xiangrui Meng > > A/B testing is widely used to compare online models. We can implement A/B > testing in MLlib and integrate it with Spark Streaming. For example, we have > a PairDStream[String, Double], whose keys are model ids and values are > observations (click or not, or revenue associated with the event). With A/B > testing, we can tell whether one model is significantly better than another > at a certain time. There are some caveats. For example, we should avoid > multiple testing and support A/A testing as a sanity check. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4259) Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4259: - Target Version/s: 1.3.0 > Add Power Iteration Clustering Algorithm with Gaussian Similarity Function > -- > > Key: SPARK-4259 > URL: https://issues.apache.org/jira/browse/SPARK-4259 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Fan Jiang >Assignee: Fan Jiang > Labels: features > > In recent years, power Iteration clustering has become one of the most > popular modern clustering algorithms. It is simple to implement, can be > solved efficiently by standard linear algebra software, and very often > outperforms traditional clustering algorithms such as the k-means algorithm. > Power iteration clustering is a scalable and efficient algorithm for > clustering points given pointwise mutual affinity values. Internally the > algorithm: > computes the Gaussian distance between all pairs of points and represents > these distances in an Affinity Matrix > calculates a Normalized Affinity Matrix > calculates the principal eigenvalue and eigenvector > Clusters each of the input points according to their principal eigenvector > component value > Details of this algorithm are found within [Power Iteration Clustering, Lin > and Cohen]{www.icml2010.org/papers/387.pdf} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1405: - Assignee: Joseph K. Bradley (was: Guoqiang Li) > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Joseph K. Bradley >Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-3996) Shade Jetty in Spark deliverables
[ https://issues.apache.org/jira/browse/SPARK-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reopened SPARK-3996: This was causing compiler failures in the master build, so I reverted it. I think it's the same issue we had with the guava patch, so I just need to go and add explicit dependencies. > Shade Jetty in Spark deliverables > - > > Key: SPARK-3996 > URL: https://issues.apache.org/jira/browse/SPARK-3996 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Mingyu Kim >Assignee: Patrick Wendell > Fix For: 1.3.0 > > > We'd like to use Spark in a Jetty 9 server, and it's causing a version > conflict. Given that Spark's dependency on Jetty is light, it'd be a good > idea to shade this dependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4349) Spark driver hangs on sc.parallelize() if exception is thrown during serialization
[ https://issues.apache.org/jira/browse/SPARK-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298017#comment-14298017 ] Matt Cheah edited comment on SPARK-4349 at 1/30/15 1:12 AM: Whoops, this was fixed by SPARK-4737. was (Author: mcheah): Whoops, this was fixed by SPARK-4737. Someone want to close this? > Spark driver hangs on sc.parallelize() if exception is thrown during > serialization > -- > > Key: SPARK-4349 > URL: https://issues.apache.org/jira/browse/SPARK-4349 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Matt Cheah >Priority: Critical > > Executing the following in the Spark Shell will lead to the Spark Shell > hanging after a stack trace is printed. The serializer is set to the Kryo > serializer. > {code} > scala> import com.esotericsoftware.kryo.io.Input > import com.esotericsoftware.kryo.io.Input > scala> import com.esotericsoftware.kryo.io.Output > import com.esotericsoftware.kryo.io.Output > scala> class MyKryoSerializable extends > com.esotericsoftware.kryo.KryoSerializable { def write (kryo: > com.esotericsoftware.kryo.Kryo, output: Output) { throw new > com.esotericsoftware.kryo.KryoException; } ; def read (kryo: > com.esotericsoftware.kryo.Kryo, input: Input) { throw new > com.esotericsoftware.kryo.KryoException; } } > defined class MyKryoSerializable > scala> sc.parallelize(Seq(new MyKryoSerializable, new > MyKryoSerializable)).collect > {code} > A stack trace is printed during serialization as expected, but another stack > trace is printed afterwards, indicating that the driver can't recover: > {code} > 14/11/11 14:10:03 ERROR OneForOneStrategy: actor name [ExecutorActor] is not > unique! > akka.actor.PostRestartException: exception post restart (class > java.io.IOException) > at > akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:249) > at > akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:247) > at > akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:302) > at > akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:297) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:247) > at > akka.actor.dungeon.FaultHandling$class.faultRecreate(FaultHandling.scala:76) > at akka.actor.ActorCell.faultRecreate(ActorCell.scala:369) > at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:459) > at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) > at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: akka.actor.InvalidActorNameException: actor name [ExecutorActor] > is not unique! > at > akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(ChildrenContainer.scala:130) > at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77) > at akka.actor.ActorCell.reserveChild(ActorCell.scala:369) > at akka.actor.dungeon.Children$class.makeChild(Children.scala:202) > at akka.actor.dungeon.Children$class.attachChild(Children.scala:42) > at akka.actor.ActorCell.attachChild(ActorCell.scala:369) > at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:552) > at org.apache.spark.executor.Executor.(Executor.scala:97) > at > org.apache.spark.scheduler.local.LocalActor.(LocalBackend.scala:53) > at > org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96) > at > org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96) > at akka.actor.TypedCreatorFunctionConsumer.produce(Props.scala:343) > at akka.actor.Props.newActor(Props.scala:252) > at akka.actor.ActorCell.newActor(ActorCell.scala:552) >
[jira] [Updated] (SPARK-5399) tree Losses strings should match loss names
[ https://issues.apache.org/jira/browse/SPARK-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5399: - Assignee: Kai Sasaki > tree Losses strings should match loss names > --- > > Key: SPARK-5399 > URL: https://issues.apache.org/jira/browse/SPARK-5399 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0, 1.2.1 >Reporter: Joseph K. Bradley >Assignee: Kai Sasaki >Priority: Minor > > tree.loss.Losses.fromString expects certain String names for losses. These > do not match the names of the loss classes but should. I believe these > strings were the original names of the losses, and we forgot to correct the > strings when we renamed the losses. > Currently: > {code} > case "leastSquaresError" => SquaredError > case "leastAbsoluteError" => AbsoluteError > case "logLoss" => LogLoss > {code} > Proposed: > {code} > case "SquaredError" => SquaredError > case "AbsoluteError" => AbsoluteError > case "LogLoss" => LogLoss > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4349) Spark driver hangs on sc.parallelize() if exception is thrown during serialization
[ https://issues.apache.org/jira/browse/SPARK-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Cheah closed SPARK-4349. - Resolution: Fixed > Spark driver hangs on sc.parallelize() if exception is thrown during > serialization > -- > > Key: SPARK-4349 > URL: https://issues.apache.org/jira/browse/SPARK-4349 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Matt Cheah >Priority: Critical > > Executing the following in the Spark Shell will lead to the Spark Shell > hanging after a stack trace is printed. The serializer is set to the Kryo > serializer. > {code} > scala> import com.esotericsoftware.kryo.io.Input > import com.esotericsoftware.kryo.io.Input > scala> import com.esotericsoftware.kryo.io.Output > import com.esotericsoftware.kryo.io.Output > scala> class MyKryoSerializable extends > com.esotericsoftware.kryo.KryoSerializable { def write (kryo: > com.esotericsoftware.kryo.Kryo, output: Output) { throw new > com.esotericsoftware.kryo.KryoException; } ; def read (kryo: > com.esotericsoftware.kryo.Kryo, input: Input) { throw new > com.esotericsoftware.kryo.KryoException; } } > defined class MyKryoSerializable > scala> sc.parallelize(Seq(new MyKryoSerializable, new > MyKryoSerializable)).collect > {code} > A stack trace is printed during serialization as expected, but another stack > trace is printed afterwards, indicating that the driver can't recover: > {code} > 14/11/11 14:10:03 ERROR OneForOneStrategy: actor name [ExecutorActor] is not > unique! > akka.actor.PostRestartException: exception post restart (class > java.io.IOException) > at > akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:249) > at > akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:247) > at > akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:302) > at > akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:297) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:247) > at > akka.actor.dungeon.FaultHandling$class.faultRecreate(FaultHandling.scala:76) > at akka.actor.ActorCell.faultRecreate(ActorCell.scala:369) > at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:459) > at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) > at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: akka.actor.InvalidActorNameException: actor name [ExecutorActor] > is not unique! > at > akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(ChildrenContainer.scala:130) > at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77) > at akka.actor.ActorCell.reserveChild(ActorCell.scala:369) > at akka.actor.dungeon.Children$class.makeChild(Children.scala:202) > at akka.actor.dungeon.Children$class.attachChild(Children.scala:42) > at akka.actor.ActorCell.attachChild(ActorCell.scala:369) > at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:552) > at org.apache.spark.executor.Executor.(Executor.scala:97) > at > org.apache.spark.scheduler.local.LocalActor.(LocalBackend.scala:53) > at > org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96) > at > org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96) > at akka.actor.TypedCreatorFunctionConsumer.produce(Props.scala:343) > at akka.actor.Props.newActor(Props.scala:252) > at akka.actor.ActorCell.newActor(ActorCell.scala:552) > at > akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:234) > ... 11 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --
[jira] [Commented] (SPARK-4349) Spark driver hangs on sc.parallelize() if exception is thrown during serialization
[ https://issues.apache.org/jira/browse/SPARK-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298017#comment-14298017 ] Matt Cheah commented on SPARK-4349: --- Whoops, this was fixed by SPARK-4737. Someone want to close this? > Spark driver hangs on sc.parallelize() if exception is thrown during > serialization > -- > > Key: SPARK-4349 > URL: https://issues.apache.org/jira/browse/SPARK-4349 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Matt Cheah >Priority: Critical > > Executing the following in the Spark Shell will lead to the Spark Shell > hanging after a stack trace is printed. The serializer is set to the Kryo > serializer. > {code} > scala> import com.esotericsoftware.kryo.io.Input > import com.esotericsoftware.kryo.io.Input > scala> import com.esotericsoftware.kryo.io.Output > import com.esotericsoftware.kryo.io.Output > scala> class MyKryoSerializable extends > com.esotericsoftware.kryo.KryoSerializable { def write (kryo: > com.esotericsoftware.kryo.Kryo, output: Output) { throw new > com.esotericsoftware.kryo.KryoException; } ; def read (kryo: > com.esotericsoftware.kryo.Kryo, input: Input) { throw new > com.esotericsoftware.kryo.KryoException; } } > defined class MyKryoSerializable > scala> sc.parallelize(Seq(new MyKryoSerializable, new > MyKryoSerializable)).collect > {code} > A stack trace is printed during serialization as expected, but another stack > trace is printed afterwards, indicating that the driver can't recover: > {code} > 14/11/11 14:10:03 ERROR OneForOneStrategy: actor name [ExecutorActor] is not > unique! > akka.actor.PostRestartException: exception post restart (class > java.io.IOException) > at > akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:249) > at > akka.actor.dungeon.FaultHandling$$anonfun$6.apply(FaultHandling.scala:247) > at > akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:302) > at > akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1.applyOrElse(FaultHandling.scala:297) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:247) > at > akka.actor.dungeon.FaultHandling$class.faultRecreate(FaultHandling.scala:76) > at akka.actor.ActorCell.faultRecreate(ActorCell.scala:369) > at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:459) > at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) > at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: akka.actor.InvalidActorNameException: actor name [ExecutorActor] > is not unique! > at > akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(ChildrenContainer.scala:130) > at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77) > at akka.actor.ActorCell.reserveChild(ActorCell.scala:369) > at akka.actor.dungeon.Children$class.makeChild(Children.scala:202) > at akka.actor.dungeon.Children$class.attachChild(Children.scala:42) > at akka.actor.ActorCell.attachChild(ActorCell.scala:369) > at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:552) > at org.apache.spark.executor.Executor.(Executor.scala:97) > at > org.apache.spark.scheduler.local.LocalActor.(LocalBackend.scala:53) > at > org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96) > at > org.apache.spark.scheduler.local.LocalBackend$$anonfun$start$1.apply(LocalBackend.scala:96) > at akka.actor.TypedCreatorFunctionConsumer.produce(Props.scala:343) > at akka.actor.Props.newActor(Props.scala:252) > at akka.actor.ActorCell.newActor(ActorCell.scala:552) > at > akka.actor.dungeon.FaultHandling$class.finishRecreate(FaultHandling.scala:234) > ... 11 more
[jira] [Updated] (SPARK-4118) Create python bindings for Streaming KMeans
[ https://issues.apache.org/jira/browse/SPARK-4118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4118: - Target Version/s: (was: 1.3.0) > Create python bindings for Streaming KMeans > --- > > Key: SPARK-4118 > URL: https://issues.apache.org/jira/browse/SPARK-4118 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark, Streaming >Reporter: Anant Daksh Asthana >Priority: Minor > > Create Python bindings for Streaming K-means > This is in reference to https://issues.apache.org/jira/browse/SPARK-3254 > which adds Streaming K-means functionality to MLLib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5101) Add common ML math functions
[ https://issues.apache.org/jira/browse/SPARK-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5101: - Target Version/s: (was: 1.3.0) > Add common ML math functions > > > Key: SPARK-5101 > URL: https://issues.apache.org/jira/browse/SPARK-5101 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: DB Tsai >Priority: Minor > > We can add common ML math functions to MLlib. It may be a little tricky to > implement those functions in a numerically stable way. For example, > {code} > math.log(1 + math.exp(x)) > {code} > should be implemented as > {code} > if (x > 0) { > x + math.log1p(math.exp(-x)) > } else { > math.log1p(math.exp(x)) > } > {code} > It becomes hard to maintain if we have multiple copies of the correct > implementation in the codebase. A good place for those functions could be > `mllib.util.MathFunctions`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3188) Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates)
[ https://issues.apache.org/jira/browse/SPARK-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3188: - Target Version/s: 1.4.0 (was: 1.3.0) > Add Robust Regression Algorithm with Tukey bisquare weight function > (Biweight Estimates) > -- > > Key: SPARK-3188 > URL: https://issues.apache.org/jira/browse/SPARK-3188 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Fan Jiang >Assignee: Fan Jiang >Priority: Minor > Labels: features > Original Estimate: 0h > Remaining Estimate: 0h > > Linear least square estimates assume the error has normal distribution and > can behave badly when the errors are heavy-tailed. In practical we get > various types of data. We need to include Robust Regression to employ a > fitting criterion that is not as vulnerable as least square. > The Tukey bisquare weight function, also referred to as the biweight > function, produces an M-estimator that is more resistant to regression > outliers than the Huber M-estimator (Andersen 2008: 19). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5012) Python API for Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5012: - Priority: Critical (was: Major) > Python API for Gaussian Mixture Model > - > > Key: SPARK-5012 > URL: https://issues.apache.org/jira/browse/SPARK-5012 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Meethu Mathew >Priority: Critical > > Add Python API for the Scala implementation of GMM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5094) Python API for gradient-boosted trees
[ https://issues.apache.org/jira/browse/SPARK-5094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5094: - Priority: Critical (was: Major) > Python API for gradient-boosted trees > - > > Key: SPARK-5094 > URL: https://issues.apache.org/jira/browse/SPARK-5094 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Kazuki Taniguchi >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
[ https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4240: - Target Version/s: (was: 1.3.0) > Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy. > > > Key: SPARK-4240 > URL: https://issues.apache.org/jira/browse/SPARK-4240 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Sung Chung > > The gradient boosting as currently implemented estimates the loss-gradient in > each iteration using regression trees. At every iteration, the regression > trees are trained/split to minimize predicted gradient variance. > Additionally, the terminal node predictions are computed to minimize the > prediction variance. > However, such predictions won't be optimal for loss functions other than the > mean-squared error. The TreeBoosting refinement can help mitigate this issue > by modifying terminal node prediction values so that those predictions would > directly minimize the actual loss function. Although this still doesn't > change the fact that the tree splits were done through variance reduction, it > should still lead to improvement in gradient estimations, and thus better > performance. > The details of this can be found in the R vignette. This paper also shows how > to refine the terminal node predictions. > http://www.saedsayad.com/docs/gbm2.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4036: - Assignee: Kai Sasaki > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298012#comment-14298012 ] Xiangrui Meng commented on SPARK-4036: -- [~lewuathe] I've assigned this ticket to you. Before sending any PR, could you share some design doc first? This is a broad topic, we should discuss algorithm choices, complexity and scalability, and public APIs before digging into implementation. > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4036: - Target Version/s: (was: 1.3.0) > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator
[ https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3181: - Target Version/s: 1.4.0 (was: 1.3.0) > Add Robust Regression Algorithm with Huber Estimator > > > Key: SPARK-3181 > URL: https://issues.apache.org/jira/browse/SPARK-3181 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Fan Jiang >Assignee: Fan Jiang > Labels: features > Original Estimate: 0h > Remaining Estimate: 0h > > Linear least square estimates assume the error has normal distribution and > can behave badly when the errors are heavy-tailed. In practical we get > various types of data. We need to include Robust Regression to employ a > fitting criterion that is not as vulnerable as least square. > In 1973, Huber introduced M-estimation for regression which stands for > "maximum likelihood type". The method is resistant to outliers in the > response variable and has been widely used. > The new feature for MLlib will contain 3 new files > /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala > /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala > /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala > and one new class HuberRobustGradient in > /main/scala/org/apache/spark/mllib/optimization/Gradient.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5486) Add validate function for BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5486: - Priority: Major (was: Critical) > Add validate function for BlockMatrix > - > > Key: SPARK-5486 > URL: https://issues.apache.org/jira/browse/SPARK-5486 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Burak Yavuz > > BlockMatrix needs a validate method to make debugging easy for users. > It will be an expensive method to perform, but it would be useful for users > to know why `multiply` or `add` didn't work properly. > Things to validate: > - MatrixBlocks that are not on the edges should have the dimensions > `rowsPerBlock` and `colsPerBlock`. > - There should be at most one block for each index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5420) Cross-langauge load/store functions for creating and saving DataFrames
[ https://issues.apache.org/jira/browse/SPARK-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297987#comment-14297987 ] Michael Armbrust commented on SPARK-5420: - Here are the dimensions that I think we need to consider: ErrorIfExisting, Overwrite, Append. Create a temp table, metastore table, or no table. Specify a data source name, or use a default (from a config option that default to parquet). Reynold also suggest that there is short hand for working with file based datasources that obviates the need to do "path" -> path. > Cross-langauge load/store functions for creating and saving DataFrames > -- > > Key: SPARK-5420 > URL: https://issues.apache.org/jira/browse/SPARK-5420 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Patrick Wendell >Assignee: Yin Huai >Priority: Blocker > > We should have standard API's for loading or saving a table from a data > store. Per comment discussion: > {code} > def loadData(datasource: String, parameters: Map[String, String]): DataFrame > def loadData(datasource: String, parameters: java.util.Map[String, String]): > DataFrame > def storeData(datasource: String, parameters: Map[String, String]): DataFrame > def storeData(datasource: String, parameters: java.util.Map[String, String]): > DataFrame > {code} > Python should have this too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5472) Add support for reading from and writing to a JDBC database
[ https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5472: Priority: Blocker (was: Minor) > Add support for reading from and writing to a JDBC database > --- > > Key: SPARK-5472 > URL: https://issues.apache.org/jira/browse/SPARK-5472 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tor Myklebust >Priority: Blocker > > It would be nice to be able to make a table in a JDBC database appear as a > table in Spark SQL. This would let users, for instance, perform a JOIN > between a DataFrame in Spark SQL with a table in a Postgres database. > It might also be nice to be able to go the other direction -- save a > DataFrame to a database -- for instance in an ETL job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5472) Add support for reading from and writing to a JDBC database
[ https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5472: Assignee: Tor Myklebust > Add support for reading from and writing to a JDBC database > --- > > Key: SPARK-5472 > URL: https://issues.apache.org/jira/browse/SPARK-5472 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tor Myklebust >Assignee: Tor Myklebust >Priority: Blocker > > It would be nice to be able to make a table in a JDBC database appear as a > table in Spark SQL. This would let users, for instance, perform a JOIN > between a DataFrame in Spark SQL with a table in a Postgres database. > It might also be nice to be able to go the other direction -- save a > DataFrame to a database -- for instance in an ETL job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5472) Add support for reading from and writing to a JDBC database
[ https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5472: Target Version/s: 1.3.0 > Add support for reading from and writing to a JDBC database > --- > > Key: SPARK-5472 > URL: https://issues.apache.org/jira/browse/SPARK-5472 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tor Myklebust >Priority: Blocker > > It would be nice to be able to make a table in a JDBC database appear as a > table in Spark SQL. This would let users, for instance, perform a JOIN > between a DataFrame in Spark SQL with a table in a Postgres database. > It might also be nice to be able to go the other direction -- save a > DataFrame to a database -- for instance in an ETL job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4959) Attributes are case sensitive when using a select query from a projection
[ https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4959. - Resolution: Fixed > Attributes are case sensitive when using a select query from a projection > - > > Key: SPARK-4959 > URL: https://issues.apache.org/jira/browse/SPARK-4959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Andy Konwinski >Assignee: Cheng Hao >Priority: Blocker > Labels: backport-needed > Fix For: 1.3.0, 1.2.1 > > > Per [~marmbrus], see this line of code, where we should be using an attribute > map > > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147 > To reproduce, i ran the following in the Spark shell: > {code} > import sqlContext._ > sql("drop table if exists test") > sql("create table test (col1 string)") > sql("""insert into table test select "hi" from prejoined limit 1""") > val projection = "col1".attr.as(Symbol("CaseSensitiveColName")) :: > "col1".attr.as(Symbol("CaseSensitiveColName2")) :: Nil > sqlContext.table("test").select(projection:_*).registerTempTable("test2") > # This succeeds. > sql("select CaseSensitiveColName from test2").first() > # This fails with java.util.NoSuchElementException: key not found: > casesensitivecolname#23046 > sql("select casesensitivecolname from test2").first() > {code} > The full stack trace printed for the final command that is failing: > {code} > java.util.NoSuchElementException: key not found: casesensitivecolname#23046 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at > org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:57) > at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) > at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221) > at > org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378) > at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) > at > org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) > at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) > at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446) > at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108) > at org.apache.spark.rdd.RDD.first(RDD.scala:1093) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscri
[jira] [Updated] (SPARK-3778) newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn
[ https://issues.apache.org/jira/browse/SPARK-3778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3778: --- Priority: Critical (was: Major) > newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn > - > > Key: SPARK-3778 > URL: https://issues.apache.org/jira/browse/SPARK-3778 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Critical > > The newAPIHadoopRDD routine doesn't properly add the credentials to the conf > to be able to access secure hdfs. > Note that newAPIHadoopFile does handle these because the > org.apache.hadoop.mapreduce.Job automatically adds it for you. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3778) newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn
[ https://issues.apache.org/jira/browse/SPARK-3778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3778: --- Target Version/s: 1.3.0 (was: 1.1.1, 1.2.0) > newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn > - > > Key: SPARK-3778 > URL: https://issues.apache.org/jira/browse/SPARK-3778 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > > The newAPIHadoopRDD routine doesn't properly add the credentials to the conf > to be able to access secure hdfs. > Note that newAPIHadoopFile does handle these because the > org.apache.hadoop.mapreduce.Job automatically adds it for you. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3996) Shade Jetty in Spark deliverables
[ https://issues.apache.org/jira/browse/SPARK-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3996. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Patrick Wendell (was: Matthew Cheah) Okay we merged this into master, let's see how it goes. > Shade Jetty in Spark deliverables > - > > Key: SPARK-3996 > URL: https://issues.apache.org/jira/browse/SPARK-3996 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Mingyu Kim >Assignee: Patrick Wendell > Fix For: 1.3.0 > > > We'd like to use Spark in a Jetty 9 server, and it's causing a version > conflict. Given that Spark's dependency on Jetty is light, it'd be a good > idea to shade this dependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in DataFrames returned from sqlCtx.sql()
[ https://issues.apache.org/jira/browse/SPARK-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297956#comment-14297956 ] Apache Spark commented on SPARK-5462: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/4282 > Catalyst UnresolvedException "Invalid call to qualifiers on unresolved > object" error when accessing fields in DataFrames returned from sqlCtx.sql() > --- > > Key: SPARK-5462 > URL: https://issues.apache.org/jira/browse/SPARK-5462 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Josh Rosen >Priority: Blocker > > When trying to access fields on a Python DataFrame created via inferSchema, I > ran into a confusing Catalyst Py4J error. Here's a reproduction: > {code} > from pyspark import SparkContext > from pyspark.sql import SQLContext, Row > sc = SparkContext("local", "test") > sqlContext = SQLContext(sc) > # Load a text file and convert each line to a Row. > lines = sc.textFile("examples/src/main/resources/people.txt") > parts = lines.map(lambda l: l.split(",")) > people = parts.map(lambda p: Row(name=p[0], age=int(p[1]))) > # Infer the schema, and register the SchemaRDD as a table. > schemaPeople = sqlContext.inferSchema(people) > schemaPeople.registerTempTable("people") > # SQL can be run over SchemaRDDs that have been registered as a table. > teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age > <= 19") > print teenagers.name > {code} > This fails with the following error: > {code} > Traceback (most recent call last): > File "/Users/joshrosen/Documents/spark/sqltest.py", line 19, in > print teenagers.name > File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, > in __getattr__ > return Column(self._jdf.apply(name)) > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o66.apply. > : org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > qualifiers on unresolved object, tree: 'name > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50) > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > This is distinct from the helpful error message that I get when trying to > access a non-existent column. This error didn't occur when I tried the same > thing with a DataFrame created via jsonRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Commented] (SPARK-5489) KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create (I)Lscala/runtime/IntRef;
[ https://issues.apache.org/jira/browse/SPARK-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297951#comment-14297951 ] DeepakVohra commented on SPARK-5489: Sean, Some dependency is making use of scala.runtime.IntRef.create, which was introduced in Scala 2.11. https://github.com/scala/scala/blob/v2.11.0/src/library/scala/runtime/IntRef.java Scala 2.10.4, which is included with Spark 1.2, does not include the scala.runtime.IntRef.create method. https://github.com/scala/scala/blob/v2.10.4/src/library/scala/runtime/IntRef.java thanks, Deepak > KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create > (I)Lscala/runtime/IntRef; > - > > Key: SPARK-5489 > URL: https://issues.apache.org/jira/browse/SPARK-5489 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.2.0 > Environment: Spark 1.2 > Maven >Reporter: DeepakVohra > > The KMeans clustering generates following error, which also seems to be due > version mismatch between Scala used for compiling Spark and Scala in Spark > 1.2 Maven dependency. > Exception in thread "main" java.lang.NoSuchMethodError: > scala.runtime.IntRef.create > (I)Lscala/runtime/IntRef; > at > org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:282) > at > org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:155) > at > org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:132) > at > org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:352) > at > org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:362) > at > org.apache.spark.mllib.clustering.KMeans.train(KMeans.scala) > at > clusterer.kmeans.KMeansClusterer.main(KMeansClusterer.java:35) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5424) Make the new ALS implementation take generic ID types
[ https://issues.apache.org/jira/browse/SPARK-5424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297949#comment-14297949 ] Apache Spark commented on SPARK-5424: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4281 > Make the new ALS implementation take generic ID types > - > > Key: SPARK-5424 > URL: https://issues.apache.org/jira/browse/SPARK-5424 > Project: Spark > Issue Type: Improvement > Components: MLlib, Spark ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > The new implementation uses local indices of users and items. So the input > user/item type could be generic, at least specialized for Int and Long. We > can expose the generic interface as a developer API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5464) Calling help() on a Python DataFrame fails with "cannot resolve column name __name__" error
[ https://issues.apache.org/jira/browse/SPARK-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5464. Resolution: Fixed Fix Version/s: 1.3.0 > Calling help() on a Python DataFrame fails with "cannot resolve column name > __name__" error > --- > > Key: SPARK-5464 > URL: https://issues.apache.org/jira/browse/SPARK-5464 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Fix For: 1.3.0 > > > Trying to call {{help()}} on a Python DataFrame fails with an exception: > {code} > >>> help(df) > Traceback (most recent call last): > File "", line 1, in > File "/Users/joshrosen/anaconda/lib/python2.7/site.py", line 464, in > __call__ > return pydoc.help(*args, **kwds) > File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1787, in > __call__ > self.help(request) > File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1834, in help > else: doc(request, 'Help on %s:') > File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1571, in doc > pager(render_doc(thing, title, forceload)) > File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1545, in > render_doc > object, name = resolve(thing, forceload) > File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1540, in > resolve > name = getattr(thing, '__name__', None) > File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, > in __getattr__ > return Column(self._jdf.apply(name)) > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o31.apply. > : java.lang.RuntimeException: Cannot resolve column name "__name__" > at > org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:123) > at > org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:123) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > Here's a reproduction: > {code} > >>> from pyspark.sql import SQLContext, Row > >>> sqlContext = SQLContext(sc) > >>> rdd = sc.parallelize(['{"foo":"bar"}', '{"foo":"baz"}']) > >>> df = sqlContext.jsonRDD(rdd) > >>> help(df) > {code} > I think the problem here is that we don't throw the expected exception from > our overloaded {{getattr}} if a column can't be found. > We should be able to fix this by only attempting to call {{apply}} after > checking that the column name is valid (e.g. check against {{columns}}). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5489) KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create (I)Lscala/runtime/IntRef;
[ https://issues.apache.org/jira/browse/SPARK-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297937#comment-14297937 ] DeepakVohra commented on SPARK-5489: Sean, Made the Scala version the same, but still getting the error. "For the Scala API, Spark 1.2.0 uses Scala 2.10. " http://spark.apache.org/docs/1.2.0/ Made Maven dependencies Scala version also 2.10. org.apache.spark spark-core_2.10 1.2.0 org.scala-lang scala-library org.scala-lang scala-compiler org.apache.spark spark-mllib_2.11 1.2.0 org.scala-lang scala-library org.scala-lang scala-compiler org.scala-lang scala-library 2.10.0 org.scala-lang scala-compiler 2.10.0 thanks, Deepak > KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create > (I)Lscala/runtime/IntRef; > - > > Key: SPARK-5489 > URL: https://issues.apache.org/jira/browse/SPARK-5489 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.2.0 > Environment: Spark 1.2 > Maven >Reporter: DeepakVohra > > The KMeans clustering generates following error, which also seems to be due > version mismatch between Scala used for compiling Spark and Scala in Spark > 1.2 Maven dependency. > Exception in thread "main" java.lang.NoSuchMethodError: > scala.runtime.IntRef.create > (I)Lscala/runtime/IntRef; > at > org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:282) > at > org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:155) > at > org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:132) > at > org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:352) > at > org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:362) > at > org.apache.spark.mllib.clustering.KMeans.train(KMeans.scala) > at > clusterer.kmeans.KMeansClusterer.main(KMeansClusterer.java:35) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5483) java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
[ https://issues.apache.org/jira/browse/SPARK-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297936#comment-14297936 ] DeepakVohra commented on SPARK-5483: Sean, Made the Scala version the same, but still getting the error. "For the Scala API, Spark 1.2.0 uses Scala 2.10. " http://spark.apache.org/docs/1.2.0/ Made Maven dependencies Scala version also 2.10. org.apache.spark spark-core_2.10 1.2.0 org.scala-lang scala-library org.scala-lang scala-compiler org.apache.spark spark-mllib_2.11 1.2.0 org.scala-lang scala-library org.scala-lang scala-compiler org.scala-lang scala-library 2.10.0 org.scala-lang scala-compiler 2.10.0 thanks, Deepak > java.lang.NoSuchMethodError: > scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object; > --- > > Key: SPARK-5483 > URL: https://issues.apache.org/jira/browse/SPARK-5483 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.2.0 > Environment: Maven > Spark 1.2 >Reporter: DeepakVohra > > Naive Bayes classifier generates following error. > ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.NoSuchMethodError: > scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object; > at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188) > at > breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303) > at > breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87) > at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303) > at > breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38) > at > breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22) > at breeze.linalg.DenseVector$.(DenseVector.scala:225) > at breeze.linalg.DenseVector$.(DenseVector.scala) > at breeze.linalg.DenseVector.(DenseVector.scala:63) > at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50) > at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55) > at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329) > at > org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:112) > at > org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:110) > at > org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:200) > at > org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:199) > at > org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:142) > at > org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:205) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 15/01/28 21:50:06 ERROR SparkUncaughtExceptionHandler: Uncaught exception in > thread Th
[jira] [Commented] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in DataFrames returned from sqlCtx.sql()
[ https://issues.apache.org/jira/browse/SPARK-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297935#comment-14297935 ] Josh Rosen commented on SPARK-5462: --- [~liancheng] [~marmbrus] Is this possibly related to SPARK-2063? > Catalyst UnresolvedException "Invalid call to qualifiers on unresolved > object" error when accessing fields in DataFrames returned from sqlCtx.sql() > --- > > Key: SPARK-5462 > URL: https://issues.apache.org/jira/browse/SPARK-5462 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > When trying to access fields on a Python DataFrame created via inferSchema, I > ran into a confusing Catalyst Py4J error. Here's a reproduction: > {code} > from pyspark import SparkContext > from pyspark.sql import SQLContext, Row > sc = SparkContext("local", "test") > sqlContext = SQLContext(sc) > # Load a text file and convert each line to a Row. > lines = sc.textFile("examples/src/main/resources/people.txt") > parts = lines.map(lambda l: l.split(",")) > people = parts.map(lambda p: Row(name=p[0], age=int(p[1]))) > # Infer the schema, and register the SchemaRDD as a table. > schemaPeople = sqlContext.inferSchema(people) > schemaPeople.registerTempTable("people") > # SQL can be run over SchemaRDDs that have been registered as a table. > teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age > <= 19") > print teenagers.name > {code} > This fails with the following error: > {code} > Traceback (most recent call last): > File "/Users/joshrosen/Documents/spark/sqltest.py", line 19, in > print teenagers.name > File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, > in __getattr__ > return Column(self._jdf.apply(name)) > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o66.apply. > : org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > qualifiers on unresolved object, tree: 'name > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50) > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > This is distinct from the helpful error message that I get when trying to > access a non-existent column. This error didn't occur when I tried the same > thing with a DataFrame created via jsonRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional command
[jira] [Updated] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in DataFrames returned from sqlCtx.sql()
[ https://issues.apache.org/jira/browse/SPARK-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5462: -- Component/s: (was: PySpark) > Catalyst UnresolvedException "Invalid call to qualifiers on unresolved > object" error when accessing fields in DataFrames returned from sqlCtx.sql() > --- > > Key: SPARK-5462 > URL: https://issues.apache.org/jira/browse/SPARK-5462 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Josh Rosen >Priority: Blocker > > When trying to access fields on a Python DataFrame created via inferSchema, I > ran into a confusing Catalyst Py4J error. Here's a reproduction: > {code} > from pyspark import SparkContext > from pyspark.sql import SQLContext, Row > sc = SparkContext("local", "test") > sqlContext = SQLContext(sc) > # Load a text file and convert each line to a Row. > lines = sc.textFile("examples/src/main/resources/people.txt") > parts = lines.map(lambda l: l.split(",")) > people = parts.map(lambda p: Row(name=p[0], age=int(p[1]))) > # Infer the schema, and register the SchemaRDD as a table. > schemaPeople = sqlContext.inferSchema(people) > schemaPeople.registerTempTable("people") > # SQL can be run over SchemaRDDs that have been registered as a table. > teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age > <= 19") > print teenagers.name > {code} > This fails with the following error: > {code} > Traceback (most recent call last): > File "/Users/joshrosen/Documents/spark/sqltest.py", line 19, in > print teenagers.name > File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, > in __getattr__ > return Column(self._jdf.apply(name)) > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o66.apply. > : org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > qualifiers on unresolved object, tree: 'name > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50) > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > This is distinct from the helpful error message that I get when trying to > access a non-existent column. This error didn't occur when I tried the same > thing with a DataFrame created via jsonRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in DataFrames returned from sqlCtx.sql()
[ https://issues.apache.org/jira/browse/SPARK-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5462: -- Assignee: (was: Josh Rosen) > Catalyst UnresolvedException "Invalid call to qualifiers on unresolved > object" error when accessing fields in DataFrames returned from sqlCtx.sql() > --- > > Key: SPARK-5462 > URL: https://issues.apache.org/jira/browse/SPARK-5462 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Josh Rosen >Priority: Blocker > > When trying to access fields on a Python DataFrame created via inferSchema, I > ran into a confusing Catalyst Py4J error. Here's a reproduction: > {code} > from pyspark import SparkContext > from pyspark.sql import SQLContext, Row > sc = SparkContext("local", "test") > sqlContext = SQLContext(sc) > # Load a text file and convert each line to a Row. > lines = sc.textFile("examples/src/main/resources/people.txt") > parts = lines.map(lambda l: l.split(",")) > people = parts.map(lambda p: Row(name=p[0], age=int(p[1]))) > # Infer the schema, and register the SchemaRDD as a table. > schemaPeople = sqlContext.inferSchema(people) > schemaPeople.registerTempTable("people") > # SQL can be run over SchemaRDDs that have been registered as a table. > teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age > <= 19") > print teenagers.name > {code} > This fails with the following error: > {code} > Traceback (most recent call last): > File "/Users/joshrosen/Documents/spark/sqltest.py", line 19, in > print teenagers.name > File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, > in __getattr__ > return Column(self._jdf.apply(name)) > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o66.apply. > : org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > qualifiers on unresolved object, tree: 'name > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50) > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > This is distinct from the helpful error message that I get when trying to > access a non-existent column. This error didn't occur when I tried the same > thing with a DataFrame created via jsonRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in DataFrames returned from sqlCtx.sql()
[ https://issues.apache.org/jira/browse/SPARK-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297934#comment-14297934 ] Josh Rosen commented on SPARK-5462: --- Actually, this issue isn't Python-specific: it also occurs when running the "people / teenagers" example from the SQL Programming Guide in the regular Spark Shell: {code} scala> teenagers("name") org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to qualifiers on unresolved object, tree: 'name at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50) at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:120) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:258) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:20) at $iwC$$iwC$$iwC$$iwC$$iwC.(:25) at $iwC$$iwC$$iwC$$iwC.(:27) at $iwC$$iwC$$iwC.(:29) at $iwC$$iwC.(:31) at $iwC.(:33) at (:35) at .(:39) at .() at .(:7) at .() at $print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:854) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:899) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:811) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:654) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:662) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:667) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:994) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:942) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:942) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:942) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1039) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:366) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} > Catalyst UnresolvedException "Invalid call to qualifiers on unresolved > object" error when accessing fields in DataFrames returned from sqlCtx.sql() > --- > > Key: SPARK-5462 > URL:
[jira] [Updated] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in DataFrames returned from sqlCtx.sql()
[ https://issues.apache.org/jira/browse/SPARK-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5462: -- Summary: Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in DataFrames returned from sqlCtx.sql() (was: Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in Python DataFrame) > Catalyst UnresolvedException "Invalid call to qualifiers on unresolved > object" error when accessing fields in DataFrames returned from sqlCtx.sql() > --- > > Key: SPARK-5462 > URL: https://issues.apache.org/jira/browse/SPARK-5462 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > When trying to access fields on a Python DataFrame created via inferSchema, I > ran into a confusing Catalyst Py4J error. Here's a reproduction: > {code} > from pyspark import SparkContext > from pyspark.sql import SQLContext, Row > sc = SparkContext("local", "test") > sqlContext = SQLContext(sc) > # Load a text file and convert each line to a Row. > lines = sc.textFile("examples/src/main/resources/people.txt") > parts = lines.map(lambda l: l.split(",")) > people = parts.map(lambda p: Row(name=p[0], age=int(p[1]))) > # Infer the schema, and register the SchemaRDD as a table. > schemaPeople = sqlContext.inferSchema(people) > schemaPeople.registerTempTable("people") > # SQL can be run over SchemaRDDs that have been registered as a table. > teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age > <= 19") > print teenagers.name > {code} > This fails with the following error: > {code} > Traceback (most recent call last): > File "/Users/joshrosen/Documents/spark/sqltest.py", line 19, in > print teenagers.name > File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, > in __getattr__ > return Column(self._jdf.apply(name)) > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o66.apply. > : org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > qualifiers on unresolved object, tree: 'name > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50) > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > This is distinct from the helpful error message that I get when trying to > access a non-existent column. This error didn't occur when I tried the same > thing with a DataFrame created via jsonRDD. -- This message was sent
[jira] [Resolved] (SPARK-5373) literal in agg grouping expressioons leads to incorrect result
[ https://issues.apache.org/jira/browse/SPARK-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5373. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4169 [https://github.com/apache/spark/pull/4169] > literal in agg grouping expressioons leads to incorrect result > --- > > Key: SPARK-5373 > URL: https://issues.apache.org/jira/browse/SPARK-5373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: wangfei > Fix For: 1.3.0 > > > select key, count( * ) from src group by key, 1 will get the wrong answer! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4133) PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297919#comment-14297919 ] Derrick Burns commented on SPARK-4133: -- I worked around it, so feel free On Thu, Jan 29, 2015 at 11:28 AM, Tathagata Das (JIRA) > PARSING_ERROR(2) when upgrading issues from 1.0.2 to 1.1.0 > -- > > Key: SPARK-4133 > URL: https://issues.apache.org/jira/browse/SPARK-4133 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Antonio Jesus Navarro > Attachments: spark_ex.logs > > > Snappy related problems found when trying to upgrade existing Spark Streaming > App from 1.0.2 to 1.1.0. > We can not run an existing 1.0.2 spark app if upgraded to 1.1.0 > > IOException is thrown by snappy (parsing_error(2)) > {code} > Executor task launch worker-0 DEBUG storage.BlockManager - Getting local > block broadcast_0 > Executor task launch worker-0 DEBUG storage.BlockManager - Level for block > broadcast_0 is StorageLevel(true, true, false, true, 1) > Executor task launch worker-0 DEBUG storage.BlockManager - Getting block > broadcast_0 from memory > Executor task launch worker-0 DEBUG storage.BlockManager - Getting local > block broadcast_0 > Executor task launch worker-0 DEBUG executor.Executor - Task 0's epoch is 0 > Executor task launch worker-0 DEBUG storage.BlockManager - Block broadcast_0 > not registered locally > Executor task launch worker-0 INFO broadcast.TorrentBroadcast - Started > reading broadcast variable 0 > sparkDriver-akka.actor.default-dispatcher-4 INFO > receiver.ReceiverSupervisorImpl - Registered receiver 0 > Executor task launch worker-0 INFO util.RecurringTimer - Started timer for > BlockGenerator at time 1414656492400 > Executor task launch worker-0 INFO receiver.BlockGenerator - Started > BlockGenerator > Thread-87 INFO receiver.BlockGenerator - Started block pushing thread > Executor task launch worker-0 INFO receiver.ReceiverSupervisorImpl - > Starting receiver > sparkDriver-akka.actor.default-dispatcher-5 INFO scheduler.ReceiverTracker - > Registered receiver for stream 0 from akka://sparkDriver > Executor task launch worker-0 INFO kafka.KafkaReceiver - Starting Kafka > Consumer Stream with group: stratioStreaming > Executor task launch worker-0 INFO kafka.KafkaReceiver - Connecting to > Zookeeper: node.stratio.com:2181 > sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] > received message StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 > cap=0]) from Actor[akka://sparkDriver/deadLetters] > sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] > received message StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 > cap=0]) from Actor[akka://sparkDriver/deadLetters] > sparkDriver-akka.actor.default-dispatcher-6 DEBUG local.LocalActor - [actor] > received message StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 > cap=0]) from Actor[akka://sparkDriver/deadLetters] > sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] > handled message (8.442354 ms) > StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from > Actor[akka://sparkDriver/deadLetters] > sparkDriver-akka.actor.default-dispatcher-2 DEBUG local.LocalActor - [actor] > handled message (8.412421 ms) > StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from > Actor[akka://sparkDriver/deadLetters] > sparkDriver-akka.actor.default-dispatcher-6 DEBUG local.LocalActor - [actor] > handled message (8.385471 ms) > StatusUpdate(0,RUNNING,java.nio.HeapByteBuffer[pos=0 lim=0 cap=0]) from > Actor[akka://sparkDriver/deadLetters] > Executor task launch worker-0 INFO utils.VerifiableProperties - Verifying > properties > Executor task launch worker-0 INFO utils.VerifiableProperties - Property > group.id is overridden to stratioStreaming > Executor task launch worker-0 INFO utils.VerifiableProperties - Property > zookeeper.connect is overridden to node.stratio.com:2181 > Executor task launch worker-0 INFO utils.VerifiableProperties - Property > zookeeper.connection.timeout.ms is overridden to 1 > Executor task launch worker-0 INFO broadcast.TorrentBroadcast - Reading > broadcast variable 0 took 0.033998997 s > Executor task launch worker-0 INFO consumer.ZookeeperConsumerConnector - > [stratioStreaming_ajn-stratio-1414656492293-8ecb3e3a], Connecting to > zookeeper instance at node.stratio.com:2181 > Executor task launch worker-0 DEBUG zkclient.ZkConnection - Creating new > ZookKeeper instance to connect to node.stratio.com:2181. > ZkClient-EventThread-169-node.stratio.com:2181 INFO zkclient.ZkEventThread - > Starting ZkClient event thread. > Executor task launch worker-0
[jira] [Resolved] (SPARK-5367) support star expression in udf
[ https://issues.apache.org/jira/browse/SPARK-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5367. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4163 [https://github.com/apache/spark/pull/4163] > support star expression in udf > -- > > Key: SPARK-5367 > URL: https://issues.apache.org/jira/browse/SPARK-5367 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: wangfei > Fix For: 1.3.0 > > > now spark sql does not support star expression in udf, the following sql will > get error > ``` > select concat( * ) from src > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in Python DataFrame
[ https://issues.apache.org/jira/browse/SPARK-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297913#comment-14297913 ] Josh Rosen commented on SPARK-5462: --- I'm working on a patch for this now. It looks like the problem crops up when trying to select columns from DataFrames that are returned by SQL queries, as opposed to ones created by applying or inferring a schema. Here's a regression test demonstrating this: {code} def test_column_selection_on_dataframes_created_by_queries(self): # Regression test for SPARK-5462 df = self.df df.registerTempTable("test") df_from_query = self.sqlCtx.sql("select key, values from test") df_from_query.key # Throws exception df_from_query.value {code} > Catalyst UnresolvedException "Invalid call to qualifiers on unresolved > object" error when accessing fields in Python DataFrame > -- > > Key: SPARK-5462 > URL: https://issues.apache.org/jira/browse/SPARK-5462 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > When trying to access fields on a Python DataFrame created via inferSchema, I > ran into a confusing Catalyst Py4J error. Here's a reproduction: > {code} > from pyspark import SparkContext > from pyspark.sql import SQLContext, Row > sc = SparkContext("local", "test") > sqlContext = SQLContext(sc) > # Load a text file and convert each line to a Row. > lines = sc.textFile("examples/src/main/resources/people.txt") > parts = lines.map(lambda l: l.split(",")) > people = parts.map(lambda p: Row(name=p[0], age=int(p[1]))) > # Infer the schema, and register the SchemaRDD as a table. > schemaPeople = sqlContext.inferSchema(people) > schemaPeople.registerTempTable("people") > # SQL can be run over SchemaRDDs that have been registered as a table. > teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age > <= 19") > print teenagers.name > {code} > This fails with the following error: > {code} > Traceback (most recent call last): > File "/Users/joshrosen/Documents/spark/sqltest.py", line 19, in > print teenagers.name > File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, > in __getattr__ > return Column(self._jdf.apply(name)) > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o66.apply. > : org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > qualifiers on unresolved object, tree: 'name > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50) > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(Gatew
[jira] [Resolved] (SPARK-4786) Parquet filter pushdown for BYTE and SHORT types
[ https://issues.apache.org/jira/browse/SPARK-4786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4786. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4156 [https://github.com/apache/spark/pull/4156] > Parquet filter pushdown for BYTE and SHORT types > > > Key: SPARK-4786 > URL: https://issues.apache.org/jira/browse/SPARK-4786 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Lian > Fix For: 1.3.0 > > > Among all integral types, currently only INT and LONG predicates can be > converted to Parquet filter predicate. BYTE and SHORT predicates can be > covered by INT. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5309) Reduce Binary/String conversion overhead when reading/writing Parquet files
[ https://issues.apache.org/jira/browse/SPARK-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5309. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4187 [https://github.com/apache/spark/pull/4187] > Reduce Binary/String conversion overhead when reading/writing Parquet files > --- > > Key: SPARK-5309 > URL: https://issues.apache.org/jira/browse/SPARK-5309 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.0 >Reporter: MIchael Davies >Priority: Minor > Fix For: 1.3.0 > > > Converting between Parquet Binary and Java Strings can form a significant > proportion of query times. > For columns which have repeated String values (which is common) the same > Binary will be repeatedly being converted. > A simple change to cache the last converted String per column was shown to > reduce query times by 25% when grouping on a data set of 66M rows on a column > with many repeated Strings. > A possible optimisation would be to hand responsibility for Binary > encoding/decoding over to Parquet so that it could ensure that this was done > only once per Binary value. > Next step is to look at Parquet code and to discuss with that project, which > I will do. > More details are available on this discussion: > http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-td10141.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5462) Catalyst UnresolvedException "Invalid call to qualifiers on unresolved object" error when accessing fields in Python DataFrame
[ https://issues.apache.org/jira/browse/SPARK-5462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-5462: - Assignee: Josh Rosen > Catalyst UnresolvedException "Invalid call to qualifiers on unresolved > object" error when accessing fields in Python DataFrame > -- > > Key: SPARK-5462 > URL: https://issues.apache.org/jira/browse/SPARK-5462 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > When trying to access fields on a Python DataFrame created via inferSchema, I > ran into a confusing Catalyst Py4J error. Here's a reproduction: > {code} > from pyspark import SparkContext > from pyspark.sql import SQLContext, Row > sc = SparkContext("local", "test") > sqlContext = SQLContext(sc) > # Load a text file and convert each line to a Row. > lines = sc.textFile("examples/src/main/resources/people.txt") > parts = lines.map(lambda l: l.split(",")) > people = parts.map(lambda p: Row(name=p[0], age=int(p[1]))) > # Infer the schema, and register the SchemaRDD as a table. > schemaPeople = sqlContext.inferSchema(people) > schemaPeople.registerTempTable("people") > # SQL can be run over SchemaRDDs that have been registered as a table. > teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age > <= 19") > print teenagers.name > {code} > This fails with the following error: > {code} > Traceback (most recent call last): > File "/Users/joshrosen/Documents/spark/sqltest.py", line 19, in > print teenagers.name > File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, > in __getattr__ > return Column(self._jdf.apply(name)) > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o66.apply. > : org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > qualifiers on unresolved object, tree: 'name > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:50) > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:46) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:143) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:140) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:140) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:126) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > This is distinct from the helpful error message that I get when trying to > access a non-existent column. This error didn't occur when I tried the same > thing with a DataFrame created via jsonRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5429) Can't generate Hive golden answer on Hive 0.13.1
[ https://issues.apache.org/jira/browse/SPARK-5429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust closed SPARK-5429. --- Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Liang-Chi Hsieh > Can't generate Hive golden answer on Hive 0.13.1 > > > Key: SPARK-5429 > URL: https://issues.apache.org/jira/browse/SPARK-5429 > Project: Spark > Issue Type: Bug >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 1.3.0 > > > I found that running HiveComparisonTest.createQueryTest to generate Hive > golden answer files on Hive 0.13.1 would throw KryoException. Since Hive > 0.13.0, Kryo plan serialization is introduced alongside javaXML one. This is > a quick fix to set hive configuration to use javaXML serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5490) KMeans costs can be incorrect if tasks need to be rerun
Sandy Ryza created SPARK-5490: - Summary: KMeans costs can be incorrect if tasks need to be rerun Key: SPARK-5490 URL: https://issues.apache.org/jira/browse/SPARK-5490 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Sandy Ryza Assignee: Sandy Ryza KMeans uses accumulators to compute the cost of a clustering at each iteration. Each time a ShuffleMapTask completes, it increments the accumulators at the driver. If a task runs twice because of failures, the accumulators get incremented twice. KMeans uses accumulators in ShuffleMapTasks. This means that a task's cost can end up being double-counted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5489) KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create (I)Lscala/runtime/IntRef;
DeepakVohra created SPARK-5489: -- Summary: KMeans clustering java.lang.NoSuchMethodError: scala.runtime.IntRef.create (I)Lscala/runtime/IntRef; Key: SPARK-5489 URL: https://issues.apache.org/jira/browse/SPARK-5489 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Environment: Spark 1.2 Maven Reporter: DeepakVohra The KMeans clustering generates following error, which also seems to be due version mismatch between Scala used for compiling Spark and Scala in Spark 1.2 Maven dependency. Exception in thread "main" java.lang.NoSuchMethodError: scala.runtime.IntRef.create (I)Lscala/runtime/IntRef; at org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:282) at org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:155) at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:132) at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:352) at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:362) at org.apache.spark.mllib.clustering.KMeans.train(KMeans.scala) at clusterer.kmeans.KMeansClusterer.main(KMeansClusterer.java:35) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5483) java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
[ https://issues.apache.org/jira/browse/SPARK-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297886#comment-14297886 ] DeepakVohra commented on SPARK-5483: Sean, As indicated Spark is compiled with Scala 2.10, but the Scala version packaged in Maven Spark 1.2 is 2.10.4, which seems to be causing version mismatch and the error. Spark 1.2 should be packaged with Scala 2.10 instead of 2.10.4. thanks, Deepak > java.lang.NoSuchMethodError: > scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object; > --- > > Key: SPARK-5483 > URL: https://issues.apache.org/jira/browse/SPARK-5483 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.2.0 > Environment: Maven > Spark 1.2 >Reporter: DeepakVohra > > Naive Bayes classifier generates following error. > ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.NoSuchMethodError: > scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object; > at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188) > at > breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303) > at > breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87) > at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303) > at > breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38) > at > breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22) > at breeze.linalg.DenseVector$.(DenseVector.scala:225) > at breeze.linalg.DenseVector$.(DenseVector.scala) > at breeze.linalg.DenseVector.(DenseVector.scala:63) > at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50) > at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55) > at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329) > at > org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:112) > at > org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:110) > at > org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:200) > at > org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:199) > at > org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:142) > at > org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:205) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:58) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 15/01/28 21:50:06 ERROR SparkUncaughtExceptionHandler: Uncaught exception in > thread Thread[Executor task launch worker-0,5,main] > java.lang.NoSuchMethodError: > scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object; > at breeze.generic.MMRegistry2$class.register(Multimethod.scala:188) > at > breeze.linalg.VectorOps$$anon$1.breeze$linalg$operators$BinaryRegistry$$super$register(Vector.scala:303) > at > breeze.linalg.operators.BinaryRegistry$class.register(BinaryOp.scala:87) > at breeze.linalg.VectorOps$$anon$1.register(Vector.scala:303) > at > breeze.linalg.operators.DenseVectorOps$$anon$1.(DenseVectorOps.scala:38) > at > breeze.linalg.operators.DenseVectorOps$class.$init$(DenseVectorOps.scala:22) > at breeze.linalg.DenseVector$.(DenseVector.scala:225) > at breeze.linalg.DenseVector$.(DenseVector.scala) > at breeze.linalg.DenseVector.(DenseVector.scala:63) > at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:50) > at breeze.linalg.DenseVector$mcD$sp.(DenseVector.scala:55) > at org.apache.spark.mllib.linalg.DenseVector.toBreeze(Vectors.scala:329) > at > org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:112) > at > org.apache.spark.mllib.classification.NaiveBayes$$anonfun$3.apply(NaiveBayes.scala:110) > at > org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:200) > at
[jira] [Commented] (SPARK-5486) Add validate function for BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-5486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297859#comment-14297859 ] Apache Spark commented on SPARK-5486: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/4279 > Add validate function for BlockMatrix > - > > Key: SPARK-5486 > URL: https://issues.apache.org/jira/browse/SPARK-5486 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Burak Yavuz >Priority: Critical > > BlockMatrix needs a validate method to make debugging easy for users. > It will be an expensive method to perform, but it would be useful for users > to know why `multiply` or `add` didn't work properly. > Things to validate: > - MatrixBlocks that are not on the edges should have the dimensions > `rowsPerBlock` and `colsPerBlock`. > - There should be at most one block for each index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-603) add simple Counter API
[ https://issues.apache.org/jira/browse/SPARK-603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reopened SPARK-603: -- > add simple Counter API > -- > > Key: SPARK-603 > URL: https://issues.apache.org/jira/browse/SPARK-603 > Project: Spark > Issue Type: New Feature >Priority: Minor > > Users need a very simple way to create counters in their jobs. Accumulators > provide a way to do this, but are a little clunky, for two reasons: > 1) the setup is a nuisance > 2) w/ delayed evaluation, you don't know when it will actually run, so its > hard to look at the values > consider this code: > {code} > def filterBogus(rdd:RDD[MyCustomClass], sc: SparkContext) = { > val filterCount = sc.accumulator(0) > val filtered = rdd.filter{r => > if (isOK(r)) true else {filterCount += 1; false} > } > println("removed " + filterCount.value + " records) > filtered > } > {code} > The println will always say 0 records were filtered, because its printed > before anything has actually run. I could print out the value later on, but > note that it would destroy the modularity of the method -- kinda ugly to > return the accumulator just so that it can get printed later on. (and of > course, the caller in turn might not know when the filter is going to get > applied, and would have to pass the accumulator up even further ...) > I'd like to have Counters which just automatically get printed out whenever a > stage has been run, and also with some api to get them back. I realize this > is tricky b/c a stage can get re-computed, so maybe you should only increment > the counters once. > Maybe a more general way to do this is to provide some callback for whenever > an RDD is computed -- by default, you would just print the counters, but the > user could replace w/ a custom handler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5464) Calling help() on a Python DataFrame fails with "cannot resolve column name __name__" error
[ https://issues.apache.org/jira/browse/SPARK-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297842#comment-14297842 ] Apache Spark commented on SPARK-5464: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/4278 > Calling help() on a Python DataFrame fails with "cannot resolve column name > __name__" error > --- > > Key: SPARK-5464 > URL: https://issues.apache.org/jira/browse/SPARK-5464 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > Trying to call {{help()}} on a Python DataFrame fails with an exception: > {code} > >>> help(df) > Traceback (most recent call last): > File "", line 1, in > File "/Users/joshrosen/anaconda/lib/python2.7/site.py", line 464, in > __call__ > return pydoc.help(*args, **kwds) > File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1787, in > __call__ > self.help(request) > File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1834, in help > else: doc(request, 'Help on %s:') > File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1571, in doc > pager(render_doc(thing, title, forceload)) > File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1545, in > render_doc > object, name = resolve(thing, forceload) > File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1540, in > resolve > name = getattr(thing, '__name__', None) > File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, > in __getattr__ > return Column(self._jdf.apply(name)) > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o31.apply. > : java.lang.RuntimeException: Cannot resolve column name "__name__" > at > org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:123) > at > org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:123) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > Here's a reproduction: > {code} > >>> from pyspark.sql import SQLContext, Row > >>> sqlContext = SQLContext(sc) > >>> rdd = sc.parallelize(['{"foo":"bar"}', '{"foo":"baz"}']) > >>> df = sqlContext.jsonRDD(rdd) > >>> help(df) > {code} > I think the problem here is that we don't throw the expected exception from > our overloaded {{getattr}} if a column can't be found. > We should be able to fix this by only attempting to call {{apply}} after > checking that the column name is valid (e.g. check against {{columns}}). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3888) Limit the memory used by python worker
[ https://issues.apache.org/jira/browse/SPARK-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-3888. - Resolution: Won't Fix > Limit the memory used by python worker > -- > > Key: SPARK-3888 > URL: https://issues.apache.org/jira/browse/SPARK-3888 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.1.0 >Reporter: Davies Liu >Assignee: Davies Liu > > Right now, we did not limit the memory by Python workers, then it maybe run > out of memory and freeze the OS. it's safe to have a configurable hard > limitation for it, which should be large than spark.executor.python.memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4939) Python updateStateByKey example hang in local mode
[ https://issues.apache.org/jira/browse/SPARK-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-4939: -- Affects Version/s: (was: 1.2.0) > Python updateStateByKey example hang in local mode > -- > > Key: SPARK-4939 > URL: https://issues.apache.org/jira/browse/SPARK-4939 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, Streaming >Affects Versions: 1.3.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5151) Parquet Predicate Pushdown Does Not Work with Nested Structures.
[ https://issues.apache.org/jira/browse/SPARK-5151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-5151: -- Component/s: (was: Spark Core) > Parquet Predicate Pushdown Does Not Work with Nested Structures. > > > Key: SPARK-5151 > URL: https://issues.apache.org/jira/browse/SPARK-5151 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 > Environment: pyspark, spark-ec2 created cluster >Reporter: Brad Willard > Labels: parquet, pyspark, sql > > I have json files of objects created with a nested structure roughly of the > formof the form: > { id: 123, event: "login", meta_data: {'user: "user1"}} > > { id: 125, event: "login", meta_data: {'user: "user2"}} > I load the data via spark with > rdd = sql_context.jsonFile() > # save it as a parquet file > rdd.saveAsParquetFile() > rdd = sql_context.parquetFile() > rdd.registerTempTable('events') > so if I run this query it works without issue if predicate pushdown is > disabled > select count(1) from events where meta_data.user = "user1" > if I enable predicate pushdown I get an error saying meta_data.user is not in > the schema > Py4JJavaError: An error occurred while calling o218.collect. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 125 > in stage 12.0 failed 4 times, most recent failure: Lost task 125.3 in stage > 12.0 (TID 6164, ): java.lang.IllegalArgumentException: Column [user] was not > found in schema! > at parquet.Preconditions.checkArgument(Preconditions.java:47) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) > at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) > . > I expect this is actually related to another bug I filed where nested > structure is not preserved with spark sql. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5151) Parquet Predicate Pushdown Does Not Work with Nested Structures.
[ https://issues.apache.org/jira/browse/SPARK-5151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-5151: -- Component/s: SQL > Parquet Predicate Pushdown Does Not Work with Nested Structures. > > > Key: SPARK-5151 > URL: https://issues.apache.org/jira/browse/SPARK-5151 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 > Environment: pyspark, spark-ec2 created cluster >Reporter: Brad Willard > Labels: parquet, pyspark, sql > > I have json files of objects created with a nested structure roughly of the > formof the form: > { id: 123, event: "login", meta_data: {'user: "user1"}} > > { id: 125, event: "login", meta_data: {'user: "user2"}} > I load the data via spark with > rdd = sql_context.jsonFile() > # save it as a parquet file > rdd.saveAsParquetFile() > rdd = sql_context.parquetFile() > rdd.registerTempTable('events') > so if I run this query it works without issue if predicate pushdown is > disabled > select count(1) from events where meta_data.user = "user1" > if I enable predicate pushdown I get an error saying meta_data.user is not in > the schema > Py4JJavaError: An error occurred while calling o218.collect. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 125 > in stage 12.0 failed 4 times, most recent failure: Lost task 125.3 in stage > 12.0 (TID 6164, ): java.lang.IllegalArgumentException: Column [user] was not > found in schema! > at parquet.Preconditions.checkArgument(Preconditions.java:47) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) > at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) > . > I expect this is actually related to another bug I filed where nested > structure is not preserved with spark sql. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5464) Calling help() on a Python DataFrame fails with "cannot resolve column name __name__" error
[ https://issues.apache.org/jira/browse/SPARK-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-5464: - Assignee: Josh Rosen > Calling help() on a Python DataFrame fails with "cannot resolve column name > __name__" error > --- > > Key: SPARK-5464 > URL: https://issues.apache.org/jira/browse/SPARK-5464 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > Trying to call {{help()}} on a Python DataFrame fails with an exception: > {code} > >>> help(df) > Traceback (most recent call last): > File "", line 1, in > File "/Users/joshrosen/anaconda/lib/python2.7/site.py", line 464, in > __call__ > return pydoc.help(*args, **kwds) > File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1787, in > __call__ > self.help(request) > File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1834, in help > else: doc(request, 'Help on %s:') > File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1571, in doc > pager(render_doc(thing, title, forceload)) > File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1545, in > render_doc > object, name = resolve(thing, forceload) > File "/Users/joshrosen/anaconda/lib/python2.7/pydoc.py", line 1540, in > resolve > name = getattr(thing, '__name__', None) > File "/Users/joshrosen/Documents/Spark/python/pyspark/sql.py", line 2154, > in __getattr__ > return Column(self._jdf.apply(name)) > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o31.apply. > : java.lang.RuntimeException: Cannot resolve column name "__name__" > at > org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:123) > at > org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:123) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:122) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:237) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} > Here's a reproduction: > {code} > >>> from pyspark.sql import SQLContext, Row > >>> sqlContext = SQLContext(sc) > >>> rdd = sc.parallelize(['{"foo":"bar"}', '{"foo":"baz"}']) > >>> df = sqlContext.jsonRDD(rdd) > >>> help(df) > {code} > I think the problem here is that we don't throw the expected exception from > our overloaded {{getattr}} if a column can't be found. > We should be able to fix this by only attempting to call {{apply}} after > checking that the column name is valid (e.g. check against {{columns}}). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5445) Make sure DataFrame expressions are usable in Java
[ https://issues.apache.org/jira/browse/SPARK-5445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297714#comment-14297714 ] Apache Spark commented on SPARK-5445: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4276 > Make sure DataFrame expressions are usable in Java > -- > > Key: SPARK-5445 > URL: https://issues.apache.org/jira/browse/SPARK-5445 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.3.0 > > > Some DataFrame expressions are not exactly usable in Java. For example, > aggregate functions are only defined in the dsl package object, which is > painful to use. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5192) Parquet fails to parse schema contains '\r'
[ https://issues.apache.org/jira/browse/SPARK-5192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297710#comment-14297710 ] Rekha Joshi commented on SPARK-5192: I have made a parquet patch on it.thanks > Parquet fails to parse schema contains '\r' > --- > > Key: SPARK-5192 > URL: https://issues.apache.org/jira/browse/SPARK-5192 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 > Environment: windows7 + Intellj idea 13.0.2 >Reporter: cen yuhai >Priority: Minor > Fix For: 1.3.0 > > > I think this is actually a bug in parquet, when i debuged 'ParquetTestData', > i found a exception as below. So i download the source of MessageTypeParser, > the funtion 'isWhitespace' do not check for '\r' > private boolean isWhitespace(String t) { > return t.equals(" ") || t.equals("\t") || t.equals("\n"); > } > So I replace all '\r' to work around this issue. > val subTestSchema = > """ > message myrecord { > optional boolean myboolean; > optional int64 mylong; > } > """.replaceAll("\r","") > at line 0: message myrecord { > at > parquet.schema.MessageTypeParser.asRepetition(MessageTypeParser.java:203) > at parquet.schema.MessageTypeParser.addType(MessageTypeParser.java:101) > at > parquet.schema.MessageTypeParser.addGroupTypeFields(MessageTypeParser.java:96) > at parquet.schema.MessageTypeParser.parse(MessageTypeParser.java:89) > at > parquet.schema.MessageTypeParser.parseMessageType(MessageTypeParser.java:79) > at > org.apache.spark.sql.parquet.ParquetTestData$.writeFile(ParquetTestData.scala:221) > at > org.apache.spark.sql.parquet.ParquetQuerySuite.beforeAll(ParquetQuerySuite.scala:92) > at > org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187) > at > org.apache.spark.sql.parquet.ParquetQuerySuite.beforeAll(ParquetQuerySuite.scala:85) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253) > at > org.apache.spark.sql.parquet.ParquetQuerySuite.run(ParquetQuerySuite.scala:85) > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5488) SPARK_LOCAL_IP not read by mesos scheduler
Martin Tapp created SPARK-5488: -- Summary: SPARK_LOCAL_IP not read by mesos scheduler Key: SPARK-5488 URL: https://issues.apache.org/jira/browse/SPARK-5488 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.1.1 Reporter: Martin Tapp Priority: Minor My environment sets SPARK_LOCAL_IP and my driver sees it. But mesos sees the one from my first available network adapter. I can even see that SPARK_LOCAL_IP is read correctly by Utils.localHostName and Utils.localIpAddress (core/src/main/scala/org/apache/spark/util/Utils.scala). Seems spark mesos framework doesn't use it. Work around for now is to disable my first adapter such that the second one becomes the one seen by Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5461) Graph should have isCheckpointed, getCheckpointFiles methods
[ https://issues.apache.org/jira/browse/SPARK-5461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297663#comment-14297663 ] Joseph K. Bradley commented on SPARK-5461: -- That sounds great if partitionsRDD can be non-transient. I'll try it but may need to ask for your help about the bugs. I'll ping you on the PR if so. Thanks! > Graph should have isCheckpointed, getCheckpointFiles methods > > > Key: SPARK-5461 > URL: https://issues.apache.org/jira/browse/SPARK-5461 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > Graph has a checkpoint method but does not have other helper functionality > which RDD has. Proposal: > {code} > /** >* Return whether this Graph has been checkpointed or not >*/ > def isCheckpointed: Boolean > /** >* Gets the name of the files to which this Graph was checkpointed >*/ > def getCheckpointFiles: Seq[String] > {code} > I need this for [SPARK-1405]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5466) Build Error caused by Guava shading in Spark
[ https://issues.apache.org/jira/browse/SPARK-5466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5466. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Marcelo Vanzin Thanks [~vanzin] for quickly fixing this! > Build Error caused by Guava shading in Spark > > > Key: SPARK-5466 > URL: https://issues.apache.org/jira/browse/SPARK-5466 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.0 >Reporter: Jian Zhou >Assignee: Marcelo Vanzin >Priority: Blocker > Fix For: 1.3.0 > > > Guava is shaded inside spark-core itself. > https://github.com/apache/spark/commit/37a5e272f898e946c09c2e7de5d1bda6f27a8f39 > This causes build error in multiple components, including Graph/MLLib/SQL, > when package com.google.common on the classpath incompatible with the version > used when compiling Utils.class > [error] bad symbolic reference. A signature in Utils.class refers to term util > [error] in package com.google.common which is not available. > [error] It may be completely missing from the current classpath, or the > version on > [error] the classpath might be incompatible with the version used when > compiling Utils.class. > [error] > [error] while compiling: > /spark/graphx/src/main/scala/org/apache/spark/graphx/util/BytecodeUtils.scala > [error] during phase: erasure > [error] library version: version 2.10.4 > [error] compiler version: version 2.10.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297652#comment-14297652 ] Joseph K. Bradley commented on SPARK-5021: -- You can also generate the documentation yourself: [https://github.com/apache/spark/blob/master/docs/README.md] > GaussianMixtureEM should be faster for SparseVector input > - > > Key: SPARK-5021 > URL: https://issues.apache.org/jira/browse/SPARK-5021 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > > GaussianMixtureEM currently converts everything to dense vectors. It would > be nice if it were faster for SparseVectors (running in time linear in the > number of non-zero values). > However, this may not be too important since clustering should rarely be done > in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5021: - Affects Version/s: (was: 1.2.0) 1.3.0 > GaussianMixtureEM should be faster for SparseVector input > - > > Key: SPARK-5021 > URL: https://issues.apache.org/jira/browse/SPARK-5021 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > > GaussianMixtureEM currently converts everything to dense vectors. It would > be nice if it were faster for SparseVectors (running in time linear in the > number of non-zero values). > However, this may not be too important since clustering should rarely be done > in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture
[ https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5400: - Assignee: Travis Galoppo > Rename GaussianMixtureEM to GaussianMixture > --- > > Key: SPARK-5400 > URL: https://issues.apache.org/jira/browse/SPARK-5400 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Travis Galoppo >Priority: Minor > > GaussianMixtureEM is following the old naming convention of including the > optimization algorithm name in the class title. We should probably rename it > to GaussianMixture so that it can use other optimization algorithms in the > future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture
[ https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297648#comment-14297648 ] Joseph K. Bradley commented on SPARK-5400: -- Thanks! Could you also please change the name of the test suite to match? > Rename GaussianMixtureEM to GaussianMixture > --- > > Key: SPARK-5400 > URL: https://issues.apache.org/jira/browse/SPARK-5400 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Travis Galoppo >Priority: Minor > > GaussianMixtureEM is following the old naming convention of including the > optimization algorithm name in the class title. We should probably rename it > to GaussianMixture so that it can use other optimization algorithms in the > future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297634#comment-14297634 ] Travis Galoppo commented on SPARK-5021: --- [~josephkb] This ticket is marked as affecting version 1.2.0 ... this should be 1.3.0 ? > GaussianMixtureEM should be faster for SparseVector input > - > > Key: SPARK-5021 > URL: https://issues.apache.org/jira/browse/SPARK-5021 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > > GaussianMixtureEM currently converts everything to dense vectors. It would > be nice if it were faster for SparseVectors (running in time linear in the > number of non-zero values). > However, this may not be too important since clustering should rarely be done > in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297622#comment-14297622 ] Travis Galoppo commented on SPARK-5021: --- [~MechCoder] The documentation for GMM is not yet completed (see SPARK-5013) ... the python interface is still being completed (SPARK-5012) and then the documentation can be completed. In the mean time, I might be able to answer your questions around the GMM code... > GaussianMixtureEM should be faster for SparseVector input > - > > Key: SPARK-5021 > URL: https://issues.apache.org/jira/browse/SPARK-5021 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > > GaussianMixtureEM currently converts everything to dense vectors. It would > be nice if it were faster for SparseVectors (running in time linear in the > number of non-zero values). > However, this may not be too important since clustering should rarely be done > in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5400) Rename GaussianMixtureEM to GaussianMixture
[ https://issues.apache.org/jira/browse/SPARK-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297613#comment-14297613 ] Travis Galoppo commented on SPARK-5400: --- Please assign to me and I will make the name change > Rename GaussianMixtureEM to GaussianMixture > --- > > Key: SPARK-5400 > URL: https://issues.apache.org/jira/browse/SPARK-5400 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > GaussianMixtureEM is following the old naming convention of including the > optimization algorithm name in the class title. We should probably rename it > to GaussianMixture so that it can use other optimization algorithms in the > future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5322) Add transpose() to BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-5322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297486#comment-14297486 ] Apache Spark commented on SPARK-5322: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/4275 > Add transpose() to BlockMatrix > -- > > Key: SPARK-5322 > URL: https://issues.apache.org/jira/browse/SPARK-5322 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Burak Yavuz > > Once Local matrices have the option to transpose, transposing a BlockMatrix > will be trivial. Again, this will be a flag, which will in the end affect > every SubMatrix in the RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4768) Add Support For Impala Encoded Timestamp (INT96)
[ https://issues.apache.org/jira/browse/SPARK-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297475#comment-14297475 ] Taiji Okada commented on SPARK-4768: [~yhuai], I've uploaded the string_timestamp tarball. It also includes a nanosecond precision timestamp value. Repro: create table string_timestamp ( dummy string, timestamp1 timestamp ) stored as parquet; insert into string_timestamp (dummy,timestamp1) values('test row 1', '2015-01-02 20:54:05'); insert into string_timestamp (dummy,timestamp1) values('test row 2', '1900-01-01'); insert into string_timestamp (dummy,timestamp1) values('test row 3', '-12-31'); insert into string_timestamp (dummy,timestamp1) values('test row 4', null); insert into string_timestamp (dummy,timestamp1) values('test row 5', '2015-01-02 20:54:10.123456789'); select * from string_timestamp; > Add Support For Impala Encoded Timestamp (INT96) > > > Key: SPARK-4768 > URL: https://issues.apache.org/jira/browse/SPARK-4768 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Pat McDonough >Priority: Critical > Attachments: 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq, > string_timestamp.gz > > > Impala is using INT96 for timestamps. Spark SQL should be able to read this > data despite the fact that it is not part of the spec. > Perhaps adding a flag to act like impala when reading parquet (like we do for > strings already) would be useful. > Here's an example of the error you might see: > {code} > Caused by: java.lang.RuntimeException: Potential loss of precision: cannot > convert INT96 > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441) > at > org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:66) > at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4768) Add Support For Impala Encoded Timestamp (INT96)
[ https://issues.apache.org/jira/browse/SPARK-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Taiji Okada updated SPARK-4768: --- Attachment: string_timestamp.gz > Add Support For Impala Encoded Timestamp (INT96) > > > Key: SPARK-4768 > URL: https://issues.apache.org/jira/browse/SPARK-4768 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Pat McDonough >Priority: Critical > Attachments: 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq, > string_timestamp.gz > > > Impala is using INT96 for timestamps. Spark SQL should be able to read this > data despite the fact that it is not part of the spec. > Perhaps adding a flag to act like impala when reading parquet (like we do for > strings already) would be useful. > Here's an example of the error you might see: > {code} > Caused by: java.lang.RuntimeException: Potential loss of precision: cannot > convert INT96 > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441) > at > org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:66) > at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5487) Dockerfile to build spark's custom akka.
[ https://issues.apache.org/jira/browse/SPARK-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297458#comment-14297458 ] jay vyas edited comment on SPARK-5487 at 1/29/15 7:57 PM: -- To reproduce this, you can use the following dockerfile. Hopefully a few minor modifications will result in a Dockerfile that we can use to build spark's *critical* akka dependency from scratch. {noformat} FROM silarsis/base RUN apt-get -yq update && apt-get -yq install openjdk-7-jdk RUN wget -q -O /tmp/sbt.tgz http://scalasbt.artifactoryonline.com/scalasbt/sbt-native-packages/org/scala-sbt/sbt/0.12.4/sbt.tgz \ && cd /usr/local \ && tar zxf /tmp/sbt.tgz ENV PATH $PATH:/usr/local/sbt/bin VOLUME /opt/progfun WORKDIR /opt/progfun RUN /usr/local/sbt/bin/sbt version RUN cd /tmp && git clone https://github.com/pwendell/akka && cd /tmp/akka && git checkout 2.2.3-shaded-proto RUN cd /tmp/akka/ RUN cd /tmp/akka && sbt compile CMD ["/bin/bash"] {noformat} was (Author: jayunit100): To reproduce this, you can use the following dockerfile. {noformat} FROM silarsis/base RUN apt-get -yq update && apt-get -yq install openjdk-7-jdk RUN wget -q -O /tmp/sbt.tgz http://scalasbt.artifactoryonline.com/scalasbt/sbt-native-packages/org/scala-sbt/sbt/0.12.4/sbt.tgz \ && cd /usr/local \ && tar zxf /tmp/sbt.tgz ENV PATH $PATH:/usr/local/sbt/bin VOLUME /opt/progfun WORKDIR /opt/progfun RUN /usr/local/sbt/bin/sbt version RUN cd /tmp && git clone https://github.com/pwendell/akka && cd /tmp/akka && git checkout 2.2.3-shaded-proto RUN cd /tmp/akka/ RUN cd /tmp/akka && sbt compile CMD ["/bin/bash"] {noformat} > Dockerfile to build spark's custom akka. > > > Key: SPARK-5487 > URL: https://issues.apache.org/jira/browse/SPARK-5487 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.2.0 >Reporter: jay vyas > > Building spark's custom shaed akka version is tricky. The code is in > https://github.com/pwendell/akka/ (branch = 2.2.3-shaded-proto) , however, > when attempting to build, I receive some strange errors. > I've attempted to fork off of a Dockerfile for {{SBT 0.12.4}}, which I'll > attach in a snippet just as an example of what we might want to facilitate > building the spark specific akka until SPARK-5293 is completed. > {noformat} > [info] Compiling 6 Scala sources and 1 Java source to > /tmp/akka/akka-multi-node-testkit/target/classes... > [warn] Class com.google.protobuf.MessageLite not found - continuing with a > stub. > [error] error while loading ProtobufDecoder, class file > '/root/.ivy2/cache/io.netty/netty/bundles/netty-3.6.6.Final.jar(org/jboss/netty/handler/codec/protobuf/ProtobufDecoder.class)' > is broken > [error] (class java.lang.NullPointerException/null) > [error] > /tmp/akka/akka-multi-node-testkit/src/main/scala/akka/remote/testconductor/RemoteConnection.scala:24: > org.jboss.netty.handler.codec.protobuf.ProtobufDecoder does not have a > constructor > [error] val proto = List(new ProtobufEncoder, new > ProtobufDecoder(TestConductorProtocol.Wrapper.getDefaultInstance)) > [error] ^ > [error] > /tmp/akka/akka-multi-node-testkit/src/main/scala/akka/remote/testkit/MultiNodeSpec.scala:267: > value await is not a member of > scala.concurrent.Future[Iterable[akka.remote.testconductor.RoleName]] > [error] Note: implicit method awaitHelper is not applicable here because it > comes after the application point and it lacks an explicit result type > [error] testConductor.getNodes.await.filterNot(_ == myself).isEmpty > [error] ^ > [error] > /tmp/akka/akka-multi-node-testkit/src/main/scala/akka/remote/testkit/MultiNodeSpec.scala:354: > value await is not a member of scala.concurrent.Future[akka.actor.Address] > [error] Note: implicit method awaitHelper is not applicable here because it > comes after the application point and it lacks an explicit result type > [error] def node(role: RoleName): ActorPath = > RootActorPath(testConductor.getAddressFor(role).await) > [error] > ^ > [warn] one warning found > [error] four errors found > [info] Updating {file:/tmp/akka/}akka-docs... > [info] Done updating. > [info] Updating {file:/tmp/akka/}akka-contrib... > [info] Done updating. > [info] Updating {file:/tmp/akka/}akka-sample-osgi-dining-hakkers-core... > [info] Done updating. > [info] Compiling 17 Scala sources to /tmp/akka/akka-cluster/target/classes... > [error] > /tmp/akka/akka-cluster/src/main/scala/akka/cluster/protobuf/ClusterMessageSerializer.scala:59: > type mismatch; > [error] found : akka.cluster.protobuf.msg.GossipEnvelope > [error] required: com.google.protobuf_spark.MessageLite
[jira] [Commented] (SPARK-5487) Dockerfile to build spark's custom akka.
[ https://issues.apache.org/jira/browse/SPARK-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297458#comment-14297458 ] jay vyas commented on SPARK-5487: - To reproduce this, you can use the following dockerfile. {noformat} FROM silarsis/base RUN apt-get -yq update && apt-get -yq install openjdk-7-jdk RUN wget -q -O /tmp/sbt.tgz http://scalasbt.artifactoryonline.com/scalasbt/sbt-native-packages/org/scala-sbt/sbt/0.12.4/sbt.tgz \ && cd /usr/local \ && tar zxf /tmp/sbt.tgz ENV PATH $PATH:/usr/local/sbt/bin VOLUME /opt/progfun WORKDIR /opt/progfun RUN /usr/local/sbt/bin/sbt version RUN cd /tmp && git clone https://github.com/pwendell/akka && cd /tmp/akka && git checkout 2.2.3-shaded-proto RUN cd /tmp/akka/ RUN cd /tmp/akka && sbt compile CMD ["/bin/bash"] {noformat} > Dockerfile to build spark's custom akka. > > > Key: SPARK-5487 > URL: https://issues.apache.org/jira/browse/SPARK-5487 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.2.0 >Reporter: jay vyas > > Building spark's custom shaed akka version is tricky. The code is in > https://github.com/pwendell/akka/ (branch = 2.2.3-shaded-proto) , however, > when attempting to build, I receive some strange errors. > I've attempted to fork off of a Dockerfile for {{SBT 0.12.4}}, which I'll > attach in a snippet just as an example of what we might want to facilitate > building the spark specific akka until SPARK-5293 is completed. > {noformat} > [info] Compiling 6 Scala sources and 1 Java source to > /tmp/akka/akka-multi-node-testkit/target/classes... > [warn] Class com.google.protobuf.MessageLite not found - continuing with a > stub. > [error] error while loading ProtobufDecoder, class file > '/root/.ivy2/cache/io.netty/netty/bundles/netty-3.6.6.Final.jar(org/jboss/netty/handler/codec/protobuf/ProtobufDecoder.class)' > is broken > [error] (class java.lang.NullPointerException/null) > [error] > /tmp/akka/akka-multi-node-testkit/src/main/scala/akka/remote/testconductor/RemoteConnection.scala:24: > org.jboss.netty.handler.codec.protobuf.ProtobufDecoder does not have a > constructor > [error] val proto = List(new ProtobufEncoder, new > ProtobufDecoder(TestConductorProtocol.Wrapper.getDefaultInstance)) > [error] ^ > [error] > /tmp/akka/akka-multi-node-testkit/src/main/scala/akka/remote/testkit/MultiNodeSpec.scala:267: > value await is not a member of > scala.concurrent.Future[Iterable[akka.remote.testconductor.RoleName]] > [error] Note: implicit method awaitHelper is not applicable here because it > comes after the application point and it lacks an explicit result type > [error] testConductor.getNodes.await.filterNot(_ == myself).isEmpty > [error] ^ > [error] > /tmp/akka/akka-multi-node-testkit/src/main/scala/akka/remote/testkit/MultiNodeSpec.scala:354: > value await is not a member of scala.concurrent.Future[akka.actor.Address] > [error] Note: implicit method awaitHelper is not applicable here because it > comes after the application point and it lacks an explicit result type > [error] def node(role: RoleName): ActorPath = > RootActorPath(testConductor.getAddressFor(role).await) > [error] > ^ > [warn] one warning found > [error] four errors found > [info] Updating {file:/tmp/akka/}akka-docs... > [info] Done updating. > [info] Updating {file:/tmp/akka/}akka-contrib... > [info] Done updating. > [info] Updating {file:/tmp/akka/}akka-sample-osgi-dining-hakkers-core... > [info] Done updating. > [info] Compiling 17 Scala sources to /tmp/akka/akka-cluster/target/classes... > [error] > /tmp/akka/akka-cluster/src/main/scala/akka/cluster/protobuf/ClusterMessageSerializer.scala:59: > type mismatch; > [error] found : akka.cluster.protobuf.msg.GossipEnvelope > [error] required: com.google.protobuf_spark.MessageLite > [error] case m: GossipEnvelope ? compress(gossipEnvelopeToProto(m)) > [error] ^ > [error] > /tmp/akka/akka-cluster/src/main/scala/akka/cluster/protobuf/ClusterMessageSerializer.scala:61: > type mismatch; > [error] found : akka.cluster.protobuf.msg.MetricsGossipEnvelope > [error] required: com.google.protobuf_spark.MessageLite > [error] case m: MetricsGossipEnvelope ? > compress(metricsGossipEnvelopeToProto(m)) > [error] > ^ > [error] > /tmp/akka/akka-cluster/src/main/scala/akka/cluster/protobuf/ClusterMessageSerializer.scala:63: > type mismatch; > [error] found : akka.cluster.protobuf.msg.Welcome > [error] required: com.google.protobuf_spark.MessageLite > [error]