[jira] [Updated] (SPARK-6520) Kyro serialization broken in the shell
[ https://issues.apache.org/jira/browse/SPARK-6520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6520: --- Component/s: Spark Shell > Kyro serialization broken in the shell > -- > > Key: SPARK-6520 > URL: https://issues.apache.org/jira/browse/SPARK-6520 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.3.0 >Reporter: Aaron Defazio > > If I start spark as follows: > {quote} > ~/spark-1.3.0-bin-hadoop2.4/bin/spark-shell --master local[1] --conf > "spark.serializer=org.apache.spark.serializer.KryoSerializer" > {quote} > Then using :paste, run > {quote} > case class Example(foo : String, bar : String) > val ex = sc.parallelize(List(Example("foo1", "bar1"), Example("foo2", > "bar2"))).collect() > {quote} > I get the error: > {quote} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.io.IOException: > com.esotericsoftware.kryo.KryoException: Error constructing instance of > class: $line3.$read > Serialization trace: > $VAL10 ($iwC) > $outer ($iwC$$iwC) > $outer ($iwC$$iwC$Example) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1140) > at > org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:979) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1873) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > {quote} > As far as I can tell, when using :paste, Kyro serialization doesn't work for > classes defined in within the same paste. It does work when the statements > are entered without paste. > This issue seems serious to me, since Kyro serialization is virtually > mandatory for performance (20x slower with default serialization on my > problem), and I'm assuming feature parity between spark-shell and > spark-submit is a goal. > Note that this is different from SPARK-6497, which covers the case when Kyro > is set to require registration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6499) pyspark: printSchema command on a dataframe hangs
[ https://issues.apache.org/jira/browse/SPARK-6499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6499: --- Component/s: PySpark > pyspark: printSchema command on a dataframe hangs > - > > Key: SPARK-6499 > URL: https://issues.apache.org/jira/browse/SPARK-6499 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: cynepia > Attachments: airports.json, pyspark.txt > > > 1. A printSchema() on a dataframe fails to respond even after a lot of time > Will attach the console logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6504) Cannot read Parquet files generated from different versions at once
[ https://issues.apache.org/jira/browse/SPARK-6504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6504: --- Component/s: SQL > Cannot read Parquet files generated from different versions at once > --- > > Key: SPARK-6504 > URL: https://issues.apache.org/jira/browse/SPARK-6504 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 >Reporter: Marius Soutier > > When trying to read Parquet files generated by Spark 1.1.1 and 1.2.1 at the > same time via > `sqlContext.parquetFile("fileFrom1.1.parqut,fileFrom1.2.parquet")` an > exception occurs: > could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has > conflicting values: > [{"type":"struct","fields":[{"name":"date","type":"string","nullable":true,"metadata":{}},{"name":"account","type":"string","nullable":true,"metadata":{}},{"name":"impressions","type":"long","nullable":false,"metadata":{}},{"name":"cost","type":"double","nullable":false,"metadata":{}},{"name":"clicks","type":"long","nullable":false,"metadata":{}},{"name":"conversions","type":"long","nullable":false,"metadata":{}},{"name":"orderValue","type":"double","nullable":false,"metadata":{}}]}, > StructType(List(StructField(date,StringType,true), > StructField(account,StringType,true), > StructField(impressions,LongType,false), StructField(cost,DoubleType,false), > StructField(clicks,LongType,false), StructField(conversions,LongType,false), > StructField(orderValue,DoubleType,false)))] > The Schema is exactly equal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6530) ChiSqSelector transformer
Xusen Yin created SPARK-6530: Summary: ChiSqSelector transformer Key: SPARK-6530 URL: https://issues.apache.org/jira/browse/SPARK-6530 Project: Spark Issue Type: Sub-task Reporter: Xusen Yin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6529) Word2Vec transformer
Xusen Yin created SPARK-6529: Summary: Word2Vec transformer Key: SPARK-6529 URL: https://issues.apache.org/jira/browse/SPARK-6529 Project: Spark Issue Type: Sub-task Reporter: Xusen Yin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6528) IDF transformer
Xusen Yin created SPARK-6528: Summary: IDF transformer Key: SPARK-6528 URL: https://issues.apache.org/jira/browse/SPARK-6528 Project: Spark Issue Type: Sub-task Reporter: Xusen Yin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6495) DataFrame#insertInto method should support insert rows with sub-columns
[ https://issues.apache.org/jira/browse/SPARK-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379243#comment-14379243 ] Chaozhong Yang edited comment on SPARK-6495 at 3/25/15 6:31 AM: Thanks! Maybe what you point at is the resolved issue https://issues.apache.org/jira/browse/SPARK-3851. Reading data from parquet files with different but compatible schemas has been supported in Spark 1.3.0. https://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging was (Author: debugger87): Thanks! Maybe what you point at is the resolved issue https://issues.apache.org/jira/browse/SPARK-3851. Reading data from parquet files with different but compatible schemas has been support in Spark 1.3.0. https://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging > DataFrame#insertInto method should support insert rows with sub-columns > --- > > Key: SPARK-6495 > URL: https://issues.apache.org/jira/browse/SPARK-6495 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Chaozhong Yang > > The original table's schema is like this: > |-- a: string (nullable = true) > |-- b: string (nullable = true) > |-- c: string (nullable = true) > |-- d: string (nullable = true) > If we want to insert one row(can be transformed into DataFrame) with this > schema: > |-- a: string (nullable = true) > |-- b: string (nullable = true) > |-- c: string (nullable = true) > Of course, that operation will fail. Actually, in many cases, people need to > insert new rows with columns which is the subset of original table columns. > If we can support and fix those issue, Spark SQL's insertion can be more > valuable to users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6495) DataFrame#insertInto method should support insert rows with sub-columns
[ https://issues.apache.org/jira/browse/SPARK-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chaozhong Yang closed SPARK-6495. - Resolution: Not a Problem > DataFrame#insertInto method should support insert rows with sub-columns > --- > > Key: SPARK-6495 > URL: https://issues.apache.org/jira/browse/SPARK-6495 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Chaozhong Yang > > The original table's schema is like this: > |-- a: string (nullable = true) > |-- b: string (nullable = true) > |-- c: string (nullable = true) > |-- d: string (nullable = true) > If we want to insert one row(can be transformed into DataFrame) with this > schema: > |-- a: string (nullable = true) > |-- b: string (nullable = true) > |-- c: string (nullable = true) > Of course, that operation will fail. Actually, in many cases, people need to > insert new rows with columns which is the subset of original table columns. > If we can support and fix those issue, Spark SQL's insertion can be more > valuable to users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6526) Add Normalizer transformer
[ https://issues.apache.org/jira/browse/SPARK-6526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379371#comment-14379371 ] Apache Spark commented on SPARK-6526: - User 'yinxusen' has created a pull request for this issue: https://github.com/apache/spark/pull/5181 > Add Normalizer transformer > -- > > Key: SPARK-6526 > URL: https://issues.apache.org/jira/browse/SPARK-6526 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xusen Yin > Fix For: 1.4.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6526) Add Normalizer transformer
[ https://issues.apache.org/jira/browse/SPARK-6526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-6526: - Description: https://github.com/apache/spark/pull/5181 > Add Normalizer transformer > -- > > Key: SPARK-6526 > URL: https://issues.apache.org/jira/browse/SPARK-6526 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xusen Yin > Fix For: 1.4.0 > > > https://github.com/apache/spark/pull/5181 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6527) sc.binaryFiles can not access files on s3
Zhao Zhang created SPARK-6527: - Summary: sc.binaryFiles can not access files on s3 Key: SPARK-6527 URL: https://issues.apache.org/jira/browse/SPARK-6527 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.3.0, 1.2.0 Environment: I am running Spark on EC2 Reporter: Zhao Zhang The sc.binaryFIles() can not access the files stored on s3. It can correctly list the number of files, but report "file does not exist" when processing them. I also tried sc.textFile() which works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6525) Add new feature transformers in ML package
[ https://issues.apache.org/jira/browse/SPARK-6525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379358#comment-14379358 ] Xusen Yin commented on SPARK-6525: -- [~mengxr] Let's add new feature transformers. I will try to write the Normalizer transformer first. > Add new feature transformers in ML package > -- > > Key: SPARK-6525 > URL: https://issues.apache.org/jira/browse/SPARK-6525 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.3.0 >Reporter: Xusen Yin > Labels: features > Fix For: 1.4.0 > > > New feature transformers should be added to ML package to make assembling ML > pipeline more easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6526) Add Normalizer transformer
Xusen Yin created SPARK-6526: Summary: Add Normalizer transformer Key: SPARK-6526 URL: https://issues.apache.org/jira/browse/SPARK-6526 Project: Spark Issue Type: Sub-task Reporter: Xusen Yin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6525) Add new feature transformers in ML package
Xusen Yin created SPARK-6525: Summary: Add new feature transformers in ML package Key: SPARK-6525 URL: https://issues.apache.org/jira/browse/SPARK-6525 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.3.0 Reporter: Xusen Yin Fix For: 1.4.0 New feature transformers should be added to ML package to make assembling ML pipeline more easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6485) Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379333#comment-14379333 ] Meethu Mathew commented on SPARK-6485: -- As you had mentioned here https://issues.apache.org/jira/browse/SPARK-6100, MatrixUDT has been merged. But MatrixUDT for PySpark seems to be under progress. Does https://issues.apache.org/jira/browse/SPARK-6390 block this task? > Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark > -- > > Key: SPARK-6485 > URL: https://issues.apache.org/jira/browse/SPARK-6485 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Xiangrui Meng > > We should add APIs for CoordinateMatrix/RowMatrix/IndexedRowMatrix in > PySpark. Internally, we can use DataFrames for serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6465) GenericRowWithSchema: KryoException: Class cannot be created (missing no-arg constructor):
[ https://issues.apache.org/jira/browse/SPARK-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377490#comment-14377490 ] Earthson Lu edited comment on SPARK-6465 at 3/25/15 5:26 AM: - I'm confused. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L94 {code} def convertRowToScala(r: Row, schema: StructType): Row = { // TODO: This is very slow!!! new GenericRowWithSchema( //Why we need GenericRowWithSchema? It seems to be the only use of GenericRowWithSchema r.toSeq.zip(schema.fields.map(_.dataType)) .map(r_dt => convertToScala(r_dt._1, r_dt._2)).toArray, schema) } {code} was (Author: earthsonlu): I'm confused. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L94 {code:scala} def convertRowToScala(r: Row, schema: StructType): Row = { // TODO: This is very slow!!! new GenericRowWithSchema( //Why we need GenericRowWithSchema? It seems to be the only use of GenericRowWithSchema r.toSeq.zip(schema.fields.map(_.dataType)) .map(r_dt => convertToScala(r_dt._1, r_dt._2)).toArray, schema) } {code} > GenericRowWithSchema: KryoException: Class cannot be created (missing no-arg > constructor): > -- > > Key: SPARK-6465 > URL: https://issues.apache.org/jira/browse/SPARK-6465 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: Spark 1.3, YARN 2.6.0, CentOS >Reporter: Earthson Lu >Assignee: Michael Armbrust >Priority: Critical > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > I can not find a issue for this. > register for GenericRowWithSchema is lost in > org.apache.spark.sql.execution.SparkSqlSerializer. > Is this the only thing we need to do? > Here is the log > {code} > 15/03/23 16:21:00 WARN TaskSetManager: Lost task 9.0 in stage 20.0 (TID > 31978, datanode06.site): com.esotericsoftware.kryo.KryoException: Class > cannot be created (missing no-arg constructor): > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema > at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1050) > at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1062) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:228) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:217) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42) > at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:138) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:66) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (SPARK-6465) GenericRowWithSchema: KryoException: Class cannot be created (missing no-arg constructor):
[ https://issues.apache.org/jira/browse/SPARK-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377490#comment-14377490 ] Earthson Lu edited comment on SPARK-6465 at 3/25/15 5:25 AM: - I'm confused. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L94 {code:scala} def convertRowToScala(r: Row, schema: StructType): Row = { // TODO: This is very slow!!! new GenericRowWithSchema( //Why we need GenericRowWithSchema? It seems to be the only use of GenericRowWithSchema r.toSeq.zip(schema.fields.map(_.dataType)) .map(r_dt => convertToScala(r_dt._1, r_dt._2)).toArray, schema) } {code} was (Author: earthsonlu): I'm confused. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L94 ```scala def convertRowToScala(r: Row, schema: StructType): Row = { // TODO: This is very slow!!! new GenericRowWithSchema( //Why we need GenericRowWithSchema? It seems to be the only use of GenericRowWithSchema r.toSeq.zip(schema.fields.map(_.dataType)) .map(r_dt => convertToScala(r_dt._1, r_dt._2)).toArray, schema) } ``` > GenericRowWithSchema: KryoException: Class cannot be created (missing no-arg > constructor): > -- > > Key: SPARK-6465 > URL: https://issues.apache.org/jira/browse/SPARK-6465 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: Spark 1.3, YARN 2.6.0, CentOS >Reporter: Earthson Lu >Assignee: Michael Armbrust >Priority: Critical > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > I can not find a issue for this. > register for GenericRowWithSchema is lost in > org.apache.spark.sql.execution.SparkSqlSerializer. > Is this the only thing we need to do? > Here is the log > {code} > 15/03/23 16:21:00 WARN TaskSetManager: Lost task 9.0 in stage 20.0 (TID > 31978, datanode06.site): com.esotericsoftware.kryo.KryoException: Class > cannot be created (missing no-arg constructor): > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema > at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1050) > at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1062) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:228) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:217) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42) > at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:138) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:66) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) -
[jira] [Created] (SPARK-6524) Problem connecting JAVA API to Spark Yarn Cluster or yarn Client
milan.b created SPARK-6524: -- Summary: Problem connecting JAVA API to Spark Yarn Cluster or yarn Client Key: SPARK-6524 URL: https://issues.apache.org/jira/browse/SPARK-6524 Project: Spark Issue Type: Question Affects Versions: 1.2.0 Environment: Ubuntu 14.10 java JDK 1.8 Reporter: milan.b Hi Team, I am trying to submit a spark job to yarn-cluster or yarn client using Java API but I was unable to do so. Following is the configuration code System.setProperty("SPARK_YARN_MODE", "true"); SparkConf sparkYarnConf = new SparkConf().setAppName("Spark yarn") .setMaster("yarn-cluster") .set("spark.executor.memory", "258m") .set("spark.driver.memory", "2588m") .set("spark.yarn.app.id", "append"); ClientArguments clientArgs =new ClientArguments(args, sparkYarnConf); Error : org.apache.spark.util.Utils] (MSC service thread 1-2) Service 'SparkUI' could not bind on port 4043. Attempting port 4044. 20:57:41,808 ERROR [stderr] (MSC service thread 1-2) 15/03/24 20:57:41 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044. 20:57:41,842 INFO [org.apache.spark.util.Utils] (MSC service thread 1-2) Successfully started service 'SparkUI' on port 4044. 20:57:41,846 ERROR [stderr] (MSC service thread 1-2) 15/03/24 20:57:41 INFO Utils: Successfully started service 'SparkUI' on port 4044. 20:57:41,850 INFO [org.apache.spark.ui.SparkUI] (MSC service thread 1-2) Started SparkUI at http://D7271:4044 20:57:41,850 ERROR [stderr] (MSC service thread 1-2) 15/03/24 20:57:41 INFO SparkUI: Started SparkUI at http://D7271:4044 20:57:41,968 INFO [org.apache.spark.scheduler.cluster.YarnClusterScheduler] (MSC service thread 1-2) Created YarnClusterScheduler 20:57:41,968 ERROR [stderr] (MSC service thread 1-2) 15/03/24 20:57:41 INFO YarnClusterScheduler: Created YarnClusterScheduler 20:57:42,243 INFO [org.apache.spark.network.netty.NettyBlockTransferService] (MSC service thread 1-2) Server created on 45902 20:57:42,243 ERROR [stderr] (MSC service thread 1-2) 15/03/24 20:57:42 INFO NettyBlockTransferService: Server created on 45902 20:57:42,247 INFO [org.apache.spark.storage.BlockManagerMaster] (MSC service thread 1-2) Trying to register BlockManager 20:57:42,247 ERROR [stderr] (MSC service thread 1-2) 15/03/24 20:57:42 INFO BlockManagerMaster: Trying to register BlockManager 20:57:42,264 INFO [org.apache.spark.storage.BlockManagerMasterActor] (sparkDriver-akka.actor.default-dispatcher-2) Registering block manager D7271:45902 with 246.0 MB RAM, BlockManagerId(, D7271, 45902) 20:57:42,265 ERROR [stderr] (sparkDriver-akka.actor.default-dispatcher-2) 15/03/24 20:57:42 INFO BlockManagerMasterActor: Registering block manager D7271:45902 with 246.0 MB RAM, BlockManagerId(, D7271, 45902) 20:57:42,272 INFO [org.apache.spark.storage.BlockManagerMaster] (MSC service thread 1-2) Registered BlockManager 20:57:42,276 ERROR [stderr] (MSC service thread 1-2) 15/03/24 20:57:42 INFO BlockManagerMaster: Registered BlockManager 20:57:42,330 ERROR [org.springframework.web.context.ContextLoader] (MSC service thread 1-2) Context initialization failed: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'exampleInitBean' defined in ServletContext resource [/WEB-INF/mvc-dispatcher-servlet.xml]: Invocation of init method failed; nested exception is java.lang.NullPointerException at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1553) [spring-beans-4.0.6.RELEASE.jar:4.0.6.RELEASE] at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:539) [spring-beans-4.0.6.RELEASE.jar:4.0.6.RELEASE] at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:475) [spring-beans-4.0.6.RELEASE.jar:4.0.6.RELEASE] at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:302) [spring-beans-4.0.6.RELEASE.jar:4.0.6.RELEASE] at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:228) [spring-beans-4.0.6.RELEASE.jar:4.0.6.RELEASE] at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:298) [spring-beans-4.0.6.RELEASE.jar:4.0.6.RELEASE] at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:193) [spring-beans-4.0.6.RELEASE.jar:4.0.6.RELEASE] at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:703) [spring-beans-4.0.6.RELEA
[jira] [Created] (SPARK-6523) Error when get attribute of StandardScalerModel, When use python api
lee.xiaobo.2006 created SPARK-6523: -- Summary: Error when get attribute of StandardScalerModel, When use python api Key: SPARK-6523 URL: https://issues.apache.org/jira/browse/SPARK-6523 Project: Spark Issue Type: Bug Components: Examples, MLlib, PySpark Affects Versions: 1.3.0 Reporter: lee.xiaobo.2006 test code === from pyspark.mllib.util import MLUtils from pyspark.mllib.linalg import Vectors from pyspark.mllib.feature import StandardScaler conf = SparkConf().setAppName('Test') sc = SparkContext(conf=conf) data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") label = data.map(lambda x: x.label) features = data.map(lambda x: x.features) scaler1 = StandardScaler().fit(features) print scaler1.std # error sc.stop() --- error: Traceback (most recent call last): File "/data1/s/apps/spark-app/app/test_ssm.py", line 22, in print scaler1.std AttributeError: 'StandardScalerModel' object has no attribute 'std' 15/03/25 12:17:28 INFO Utils: path = /data1/s/apps/spark-1.4.0-SNAPSHOT/data/spark-eb1ed7c0-a5ce-4748-a817-3cb0687ee282/blockmgr-5398b477-127d-4259-a71b-608a324e1cd3, already present as root for deletion. = Another question, how to serialize or save the scaler model ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6495) DataFrame#insertInto method should support insert rows with sub-columns
[ https://issues.apache.org/jira/browse/SPARK-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379243#comment-14379243 ] Chaozhong Yang edited comment on SPARK-6495 at 3/25/15 4:08 AM: Thanks! Maybe what you point at is the resolved issue https://issues.apache.org/jira/browse/SPARK-3851. Reading data from parquet files with different but compatible schemas has been support in Spark 1.3.0. https://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging was (Author: debugger87): Thanks! Maybe what you point at is the resolved issue https://issues.apache.org/jira/browse/SPARK-3851. Reading data from parquet files with different but compatible schemas has been support in Spark 1.3.0. > DataFrame#insertInto method should support insert rows with sub-columns > --- > > Key: SPARK-6495 > URL: https://issues.apache.org/jira/browse/SPARK-6495 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Chaozhong Yang > > The original table's schema is like this: > |-- a: string (nullable = true) > |-- b: string (nullable = true) > |-- c: string (nullable = true) > |-- d: string (nullable = true) > If we want to insert one row(can be transformed into DataFrame) with this > schema: > |-- a: string (nullable = true) > |-- b: string (nullable = true) > |-- c: string (nullable = true) > Of course, that operation will fail. Actually, in many cases, people need to > insert new rows with columns which is the subset of original table columns. > If we can support and fix those issue, Spark SQL's insertion can be more > valuable to users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6495) DataFrame#insertInto method should support insert rows with sub-columns
[ https://issues.apache.org/jira/browse/SPARK-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379243#comment-14379243 ] Chaozhong Yang commented on SPARK-6495: --- Thanks! Maybe what you point at is the resolved issue https://issues.apache.org/jira/browse/SPARK-3851. Reading data from parquet files with different but compatible schemas has been support in Spark 1.3.0. > DataFrame#insertInto method should support insert rows with sub-columns > --- > > Key: SPARK-6495 > URL: https://issues.apache.org/jira/browse/SPARK-6495 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Chaozhong Yang > > The original table's schema is like this: > |-- a: string (nullable = true) > |-- b: string (nullable = true) > |-- c: string (nullable = true) > |-- d: string (nullable = true) > If we want to insert one row(can be transformed into DataFrame) with this > schema: > |-- a: string (nullable = true) > |-- b: string (nullable = true) > |-- c: string (nullable = true) > Of course, that operation will fail. Actually, in many cases, people need to > insert new rows with columns which is the subset of original table columns. > If we can support and fix those issue, Spark SQL's insertion can be more > valuable to users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6522) Standardize Random Number Generation
RJ Nowling created SPARK-6522: - Summary: Standardize Random Number Generation Key: SPARK-6522 URL: https://issues.apache.org/jira/browse/SPARK-6522 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: RJ Nowling Priority: Minor Generation of random numbers in Spark has to be handled carefully since references to RNGs copy the state to the workers. As such, a separate RNG needs to be seeded for each partition. Each time random numbers are used in Spark's libraries, the RNG seeding is re-implemented, leaving open the possibility of mistakes. It would be useful if RNG seeding was standardized through utility functions or random number generation functions that can be called in Spark pipelines. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6521) executors in the same node read local shuffle file
[ https://issues.apache.org/jira/browse/SPARK-6521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379207#comment-14379207 ] Apache Spark commented on SPARK-6521: - User 'viper-kun' has created a pull request for this issue: https://github.com/apache/spark/pull/5178 > executors in the same node read local shuffle file > -- > > Key: SPARK-6521 > URL: https://issues.apache.org/jira/browse/SPARK-6521 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: xukun > > In the past, executor read other executor's shuffle file in the same node by > net. This pr make that executors in the same node read local shuffle file In > sort-based Shuffle. It will reduce net transport. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6521) executors in the same node read local shuffle file
[ https://issues.apache.org/jira/browse/SPARK-6521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xukun updated SPARK-6521: - Summary: executors in the same node read local shuffle file (was: executor in the same node read local shuffle file) > executors in the same node read local shuffle file > -- > > Key: SPARK-6521 > URL: https://issues.apache.org/jira/browse/SPARK-6521 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: xukun > > In the past, executor read other executor's shuffle file in the same node by > net. This pr make that executors in the same node read local shuffle file In > sort-based Shuffle. It will reduce net transport. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6521) executor in the same node read local shuffle file
xukun created SPARK-6521: Summary: executor in the same node read local shuffle file Key: SPARK-6521 URL: https://issues.apache.org/jira/browse/SPARK-6521 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: xukun In the past, executor read other executor's shuffle file in the same node by net. This pr make that executors in the same node read local shuffle file In sort-based Shuffle. It will reduce net transport. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6450) Native Parquet reader does not assign table name as qualifier
[ https://issues.apache.org/jira/browse/SPARK-6450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379125#comment-14379125 ] Cheng Lian edited comment on SPARK-6450 at 3/25/15 1:56 AM: [~chinnitv], never mind, reproduced this issue with 1.3.0 release and the following Spark shell snippet: {noformat} sqlContext.sql("""create table if not exists Orders (Country string, ProductCategory string, PlacedDate string) stored as parquet""") sqlContext.sql(""" select Orders.Country, Orders.ProductCategory, count(1) from Orders join ( select Orders.Country, count(1) CountryOrderCount from Orders where to_date(Orders.PlacedDate) > '2015-01-01' group by Orders.Country order by CountryOrderCount DESC LIMIT 5 ) Top5Countries on Top5Countries.Country = Orders.Country where to_date(Orders.PlacedDate) > '2015-01-01' group by Orders.Country, Orders.ProductCategory """).queryExecution.analyzed {noformat} was (Author: lian cheng): [~chinnitv], never mind, reproduced this issue with 1.3.0 release and the following Spark shell snippet: {noformat} sqlContext.sql("""create table if not exists Orders (Country string, ProductCategory string, PlacedDate string) stored as parquet""") sqlContext.sql(""" select Orders.Country, Orders.ProductCategory, count(1) from Orders join ( select Orders.Country, count(1) CountryOrderCount from Orders where to_date(Orders.PlacedDate) > '2015-01-01' group by Orders.Country order by CountryOrderCount DESC LIMIT 5 ) Top5Countries on Top5Countries.Country = Orders.Country where to_date(Orders.PlacedDate) > '2015-01-01' group by Orders.Country, Orders.ProductCategory """) {noformat} > Native Parquet reader does not assign table name as qualifier > - > > Key: SPARK-6450 > URL: https://issues.apache.org/jira/browse/SPARK-6450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Anand Mohan Tumuluri >Assignee: Michael Armbrust >Priority: Blocker > > The below query was working fine till 1.3 commit > 9a151ce58b3e756f205c9f3ebbbf3ab0ba5b33fd.(Yes it definitely works at this > commit although this commit is completely unrelated) > It got broken in 1.3.0 release with an AnalysisException: resolved attributes > ... missing from (although this list contains the fields which it > reports missing) > {code} > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:189) > at > org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) > at > org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at > org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) > at com.sun.proxy.$Proxy17.executeStatementAsync(Unknown Source) > at > org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.t
[jira] [Updated] (SPARK-6520) Kyro serialization broken in the shell
[ https://issues.apache.org/jira/browse/SPARK-6520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Defazio updated SPARK-6520: - Description: If I start spark as follows: {quote} ~/spark-1.3.0-bin-hadoop2.4/bin/spark-shell --master local[1] --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" {quote} Then using :paste, run {quote} case class Example(foo : String, bar : String) val ex = sc.parallelize(List(Example("foo1", "bar1"), Example("foo2", "bar2"))).collect() {quote} I get the error: {quote} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.io.IOException: com.esotericsoftware.kryo.KryoException: Error constructing instance of class: $line3.$read Serialization trace: $VAL10 ($iwC) $outer ($iwC$$iwC) $outer ($iwC$$iwC$Example) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1140) at org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:979) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1873) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) {quote} As far as I can tell, when using :paste, Kyro serialization doesn't work for classes defined in within the same paste. It does work when the statements are entered without paste. This issue seems serious to me, since Kyro serialization is virtually mandatory for performance (20x slower with default serialization on my problem), and I'm assuming feature parity between spark-shell and spark-submit is a goal. Note that this is different from SPARK-6497, which covers the case when Kyro is set to require registration. was: If I start spark as follows: {quote} ~/spark-1.3.0-bin-hadoop2.4/bin/spark-shell --master local[1] --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" {quote} Then using :paste, run {quote} case class Example(foo : String, bar : String) val ex = sc.parallelize(List(Example("foo1", "bar1"), Example("foo2", "bar2"))).collect() {quote} I get the error: {quote} $VAL10 ($iwC) $outer ($iwC$$iwC) $outer ($iwC$$iwC$Example) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1140) at org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:979) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1873) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) {quote} As far as I can tell, when using :paste,
[jira] [Created] (SPARK-6520) Kyro serialization broken in the shell
Aaron Defazio created SPARK-6520: Summary: Kyro serialization broken in the shell Key: SPARK-6520 URL: https://issues.apache.org/jira/browse/SPARK-6520 Project: Spark Issue Type: Bug Affects Versions: 1.3.0 Reporter: Aaron Defazio If I start spark as follows: {quote} ~/spark-1.3.0-bin-hadoop2.4/bin/spark-shell --master local[1] --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" {quote} Then using :paste, run {quote} case class Example(foo : String, bar : String) val ex = sc.parallelize(List(Example("foo1", "bar1"), Example("foo2", "bar2"))).collect() {quote} I get the error: {quote} $VAL10 ($iwC) $outer ($iwC$$iwC) $outer ($iwC$$iwC$Example) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1140) at org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:979) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1873) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) {quote} As far as I can tell, when using :paste, Kyro serialization doesn't work for classes defined in within the same paste. It does work when the statements are entered without paste. This issue seems serious to me, since Kyro serialization is virtually mandatory for performance (20x slower with default serialization on my problem), and I'm assuming feature parity between spark-shell and spark-submit is a goal. Note that this is different from SPARK-6497, which covers the case when Kyro is set to require registration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6450) Native Parquet reader does not assign table name as qualifier
[ https://issues.apache.org/jira/browse/SPARK-6450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379125#comment-14379125 ] Cheng Lian commented on SPARK-6450: --- [~chinnitv], never mind, reproduced this issue with 1.3.0 release and the following Spark shell snippet: {noformat} sqlContext.sql("""create table if not exists Orders (Country string, ProductCategory string, PlacedDate string) stored as parquet""") sqlContext.sql(""" select Orders.Country, Orders.ProductCategory, count(1) from Orders join ( select Orders.Country, count(1) CountryOrderCount from Orders where to_date(Orders.PlacedDate) > '2015-01-01' group by Orders.Country order by CountryOrderCount DESC LIMIT 5 ) Top5Countries on Top5Countries.Country = Orders.Country where to_date(Orders.PlacedDate) > '2015-01-01' group by Orders.Country, Orders.ProductCategory """) {noformat} > Native Parquet reader does not assign table name as qualifier > - > > Key: SPARK-6450 > URL: https://issues.apache.org/jira/browse/SPARK-6450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Anand Mohan Tumuluri >Assignee: Michael Armbrust >Priority: Blocker > > The below query was working fine till 1.3 commit > 9a151ce58b3e756f205c9f3ebbbf3ab0ba5b33fd.(Yes it definitely works at this > commit although this commit is completely unrelated) > It got broken in 1.3.0 release with an AnalysisException: resolved attributes > ... missing from (although this list contains the fields which it > reports missing) > {code} > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:189) > at > org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) > at > org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at > org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) > at com.sun.proxy.$Proxy17.executeStatementAsync(Unknown Source) > at > org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > {code} > select Orders.Country, Orders.ProductCategory,count(1) from Orders join > (select Orders.Country, count(1) CountryOrderCount from Orders where > to_date(Orders.PlacedDate) > '2015-01-01' group by Orders.Country order by > CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = > Orders.Country where to_date(Orders.PlacedDate) > '2015-01-01' group by > Orders.Country,Orders.ProductCategory; > {code} > The temporary workaround is to add explicit alias for the table
[jira] [Commented] (SPARK-6450) Native Parquet reader does not assign table name as qualifier
[ https://issues.apache.org/jira/browse/SPARK-6450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379114#comment-14379114 ] Cheng Lian commented on SPARK-6450: --- [~chinnitv], could you please provide the DDL of the Orders table to help reproducing this issue? > Native Parquet reader does not assign table name as qualifier > - > > Key: SPARK-6450 > URL: https://issues.apache.org/jira/browse/SPARK-6450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Anand Mohan Tumuluri >Assignee: Michael Armbrust >Priority: Blocker > > The below query was working fine till 1.3 commit > 9a151ce58b3e756f205c9f3ebbbf3ab0ba5b33fd.(Yes it definitely works at this > commit although this commit is completely unrelated) > It got broken in 1.3.0 release with an AnalysisException: resolved attributes > ... missing from (although this list contains the fields which it > reports missing) > {code} > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:189) > at > org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) > at > org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at > org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) > at com.sun.proxy.$Proxy17.executeStatementAsync(Unknown Source) > at > org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > {code} > select Orders.Country, Orders.ProductCategory,count(1) from Orders join > (select Orders.Country, count(1) CountryOrderCount from Orders where > to_date(Orders.PlacedDate) > '2015-01-01' group by Orders.Country order by > CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = > Orders.Country where to_date(Orders.PlacedDate) > '2015-01-01' group by > Orders.Country,Orders.ProductCategory; > {code} > The temporary workaround is to add explicit alias for the table Orders > {code} > select o.Country, o.ProductCategory,count(1) from Orders o join (select > r.Country, count(1) CountryOrderCount from Orders r where > to_date(r.PlacedDate) > '2015-01-01' group by r.Country order by > CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = > o.Country where to_date(o.PlacedDate) > '2015-01-01' group by > o.Country,o.ProductCategory; > {code} > However this change not only affects self joins, it also seems to affect > union queries as well, like the below query which was again working > before(commit 9a151ce) got broken > {code} > select Orders.Country,null,count(1) OrderCount from Orders group by > Orders.Country,null > union all > select null,Orders.ProductCategory,count(1) OrderCount from Orders group by > null, Orders.ProductCategory
[jira] [Updated] (SPARK-6413) For data source tables, we should provide better output for described formatted.
[ https://issues.apache.org/jira/browse/SPARK-6413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6413: Priority: Major (was: Critical) > For data source tables, we should provide better output for described > formatted. > > > Key: SPARK-6413 > URL: https://issues.apache.org/jira/browse/SPARK-6413 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > Right now, we will show Hive's stuff like SerDe. Users will be confused when > they see the output of "DESCRIBE FORMATTED" (it is a Hive native command for > now) and think the table is not stored in the "right" format. Actually, the > table is indeed stored in the right format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6413) For data source tables, we should provide better output for described formatted.
[ https://issues.apache.org/jira/browse/SPARK-6413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6413: Target Version/s: 1.4.0 (was: 1.3.1) > For data source tables, we should provide better output for described > formatted. > > > Key: SPARK-6413 > URL: https://issues.apache.org/jira/browse/SPARK-6413 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > Right now, we will show Hive's stuff like SerDe. Users will be confused when > they see the output of "DESCRIBE FORMATTED" (it is a Hive native command for > now) and think the table is not stored in the "right" format. Actually, the > table is indeed stored in the right format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6413) For data source tables, we should provide better output for described formatted.
[ https://issues.apache.org/jira/browse/SPARK-6413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6413: Description: Right now, we will show Hive's stuff like SerDe. Users will be confused when they see the output of "DESCRIBE FORMATTED" (it is a Hive native command for now) and think the table is not stored in the "right" format. Actually, the table is indeed stored in the right format. (was: Right now, we will show Hive's stuff like SerDe. Users will be confused when they see the output of "DESCRIBE EXTENDED/FORMATTED" and think the table is not stored in the "right" format. Actually, the table is indeed stored in the right format.) > For data source tables, we should provide better output for described > formatted. > > > Key: SPARK-6413 > URL: https://issues.apache.org/jira/browse/SPARK-6413 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > Right now, we will show Hive's stuff like SerDe. Users will be confused when > they see the output of "DESCRIBE FORMATTED" (it is a Hive native command for > now) and think the table is not stored in the "right" format. Actually, the > table is indeed stored in the right format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6413) For data source tables, we should provide better output for described formatted.
[ https://issues.apache.org/jira/browse/SPARK-6413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6413: Summary: For data source tables, we should provide better output for described formatted. (was: For data source tables, we should provide better output for described extended/formatted.) > For data source tables, we should provide better output for described > formatted. > > > Key: SPARK-6413 > URL: https://issues.apache.org/jira/browse/SPARK-6413 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > Right now, we will show Hive's stuff like SerDe. Users will be confused when > they see the output of "DESCRIBE EXTENDED/FORMATTED" and think the table is > not stored in the "right" format. Actually, the table is indeed stored in the > right format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6450) Native Parquet reader does not assign table name as qualifier
[ https://issues.apache.org/jira/browse/SPARK-6450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379086#comment-14379086 ] Michael Armbrust commented on SPARK-6450: - I am having trouble reproducing this issue. Can you explain more how you are creating the table in question? > Native Parquet reader does not assign table name as qualifier > - > > Key: SPARK-6450 > URL: https://issues.apache.org/jira/browse/SPARK-6450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Anand Mohan Tumuluri >Assignee: Michael Armbrust >Priority: Blocker > > The below query was working fine till 1.3 commit > 9a151ce58b3e756f205c9f3ebbbf3ab0ba5b33fd.(Yes it definitely works at this > commit although this commit is completely unrelated) > It got broken in 1.3.0 release with an AnalysisException: resolved attributes > ... missing from (although this list contains the fields which it > reports missing) > {code} > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:189) > at > org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) > at > org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at > org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) > at com.sun.proxy.$Proxy17.executeStatementAsync(Unknown Source) > at > org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > {code} > select Orders.Country, Orders.ProductCategory,count(1) from Orders join > (select Orders.Country, count(1) CountryOrderCount from Orders where > to_date(Orders.PlacedDate) > '2015-01-01' group by Orders.Country order by > CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = > Orders.Country where to_date(Orders.PlacedDate) > '2015-01-01' group by > Orders.Country,Orders.ProductCategory; > {code} > The temporary workaround is to add explicit alias for the table Orders > {code} > select o.Country, o.ProductCategory,count(1) from Orders o join (select > r.Country, count(1) CountryOrderCount from Orders r where > to_date(r.PlacedDate) > '2015-01-01' group by r.Country order by > CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = > o.Country where to_date(o.PlacedDate) > '2015-01-01' group by > o.Country,o.ProductCategory; > {code} > However this change not only affects self joins, it also seems to affect > union queries as well, like the below query which was again working > before(commit 9a151ce) got broken > {code} > select Orders.Country,null,count(1) OrderCount from Orders group by > Orders.Country,null > union all > select null,Orders.ProductCategory,count(1) OrderCount from Orders group by > null
[jira] [Updated] (SPARK-6519) Add spark.ml API for Hierarchical KMeans
[ https://issues.apache.org/jira/browse/SPARK-6519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6519: - Summary: Add spark.ml API for Hierarchical KMeans (was: Add wrapper classes of DataFrame in spark.ml) > Add spark.ml API for Hierarchical KMeans > > > Key: SPARK-6519 > URL: https://issues.apache.org/jira/browse/SPARK-6519 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Yu Ishikawa > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5771) Number of Cores in Completed Applications of Standalone Master Web Page always be 0 if sc.stop() is called
[ https://issues.apache.org/jira/browse/SPARK-5771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379000#comment-14379000 ] Apache Spark commented on SPARK-5771: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/5177 > Number of Cores in Completed Applications of Standalone Master Web Page > always be 0 if sc.stop() is called > -- > > Key: SPARK-5771 > URL: https://issues.apache.org/jira/browse/SPARK-5771 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.1 >Reporter: Liangliang Gu >Assignee: Liangliang Gu >Priority: Minor > > In Standalone mode, the number of cores in Completed Applications of the > Master Web Page will always be zero, if sc.stop() is called. > But the number will always be right, if sc.stop() is not called. > The reason maybe: > after sc.stop() is called, the function removeExecutor of class > ApplicationInfo will be called, thus reduce the variable coresGranted to > zero. The variable coresGranted is used to display the number of Cores on > the Web Page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6517) Implement the Algorithm of Hierarchical Clustering
[ https://issues.apache.org/jira/browse/SPARK-6517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Ishikawa updated SPARK-6517: --- Summary: Implement the Algorithm of Hierarchical Clustering (was: Implementing the Algorithm) > Implement the Algorithm of Hierarchical Clustering > -- > > Key: SPARK-6517 > URL: https://issues.apache.org/jira/browse/SPARK-6517 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Yu Ishikawa > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6519) Add wrapper classes of DataFrame in spark.ml
Yu Ishikawa created SPARK-6519: -- Summary: Add wrapper classes of DataFrame in spark.ml Key: SPARK-6519 URL: https://issues.apache.org/jira/browse/SPARK-6519 Project: Spark Issue Type: Sub-task Components: ML Reporter: Yu Ishikawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6518) Add an example and document for Hierarchical Clustering
Yu Ishikawa created SPARK-6518: -- Summary: Add an example and document for Hierarchical Clustering Key: SPARK-6518 URL: https://issues.apache.org/jira/browse/SPARK-6518 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Yu Ishikawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6517) Implementing the Algorithm
Yu Ishikawa created SPARK-6517: -- Summary: Implementing the Algorithm Key: SPARK-6517 URL: https://issues.apache.org/jira/browse/SPARK-6517 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Yu Ishikawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6081) DriverRunner doesn't support pulling HTTP/HTTPS URIs
[ https://issues.apache.org/jira/browse/SPARK-6081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6081: - Affects Version/s: 1.0.0 > DriverRunner doesn't support pulling HTTP/HTTPS URIs > > > Key: SPARK-6081 > URL: https://issues.apache.org/jira/browse/SPARK-6081 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 1.0.0 >Reporter: Timothy Chen >Priority: Minor > > Standalone cluster mode according to the docs supports specifying http|https > jar urls, but when actually called the urls passed to the driver runner is > not able to pull http uris due to the usage of hadoopfs get. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6469) Improving documentation on YARN local directories usage
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6469: - Affects Version/s: 1.0.0 > Improving documentation on YARN local directories usage > --- > > Key: SPARK-6469 > URL: https://issues.apache.org/jira/browse/SPARK-6469 > Project: Spark > Issue Type: Documentation > Components: Documentation, YARN >Affects Versions: 1.0.0 >Reporter: Christophe Préaud >Assignee: Christophe Préaud >Priority: Minor > Fix For: 1.3.1, 1.4.0 > > Attachments: TestYarnVars.scala > > > According to the [Spark YARN doc > page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], > Spark executors will use the local directories configured for YARN, not > {{spark.local.dir}} which should be ignored. > However it should be noted that in yarn-client mode, though the executors > will indeed use the local directories configured for YARN, the driver will > not, because it is not running on the YARN cluster; the driver in yarn-client > will use the local directories defined in {{spark.local.dir}} > Can this please be clarified in the Spark YARN documentation above? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6469) Improving documentation on YARN local directories usage
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6469: - Component/s: Documentation > Improving documentation on YARN local directories usage > --- > > Key: SPARK-6469 > URL: https://issues.apache.org/jira/browse/SPARK-6469 > Project: Spark > Issue Type: Documentation > Components: Documentation, YARN >Affects Versions: 1.0.0 >Reporter: Christophe Préaud >Assignee: Christophe Préaud >Priority: Minor > Fix For: 1.3.1, 1.4.0 > > Attachments: TestYarnVars.scala > > > According to the [Spark YARN doc > page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], > Spark executors will use the local directories configured for YARN, not > {{spark.local.dir}} which should be ignored. > However it should be noted that in yarn-client mode, though the executors > will indeed use the local directories configured for YARN, the driver will > not, because it is not running on the YARN cluster; the driver in yarn-client > will use the local directories defined in {{spark.local.dir}} > Can this please be clarified in the Spark YARN documentation above? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6469) Improving documentation on YARN local directories usage
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-6469. Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Assignee: Christophe Préaud Target Version/s: 1.3.1, 1.4.0 > Improving documentation on YARN local directories usage > --- > > Key: SPARK-6469 > URL: https://issues.apache.org/jira/browse/SPARK-6469 > Project: Spark > Issue Type: Documentation > Components: Documentation, YARN >Affects Versions: 1.0.0 >Reporter: Christophe Préaud >Assignee: Christophe Préaud >Priority: Minor > Fix For: 1.3.1, 1.4.0 > > Attachments: TestYarnVars.scala > > > According to the [Spark YARN doc > page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], > Spark executors will use the local directories configured for YARN, not > {{spark.local.dir}} which should be ignored. > However it should be noted that in yarn-client mode, though the executors > will indeed use the local directories configured for YARN, the driver will > not, because it is not running on the YARN cluster; the driver in yarn-client > will use the local directories defined in {{spark.local.dir}} > Can this please be clarified in the Spark YARN documentation above? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6512) Add contains to OpenHashMap
[ https://issues.apache.org/jira/browse/SPARK-6512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6512. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5171 [https://github.com/apache/spark/pull/5171] > Add contains to OpenHashMap > --- > > Key: SPARK-6512 > URL: https://issues.apache.org/jira/browse/SPARK-6512 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Minor > Fix For: 1.4.0 > > > Add `contains` to test whether a key exists in an OpenHashMap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6515) Use while(true) in OpenHashSet.getPos
[ https://issues.apache.org/jira/browse/SPARK-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6515: - Priority: Minor (was: Major) > Use while(true) in OpenHashSet.getPos > - > > Key: SPARK-6515 > URL: https://issues.apache.org/jira/browse/SPARK-6515 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6515) Use while(true) in OpenHashSet.getPos
[ https://issues.apache.org/jira/browse/SPARK-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6515: - Description: Though I don't see any bug in the existing code, using `while (true)` makes the code read better. > Use while(true) in OpenHashSet.getPos > - > > Key: SPARK-6515 > URL: https://issues.apache.org/jira/browse/SPARK-6515 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Minor > > Though I don't see any bug in the existing code, using `while (true)` makes > the code read better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6515) Use while(true) in OpenHashSet.getPos
[ https://issues.apache.org/jira/browse/SPARK-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6515: - Summary: Use while(true) in OpenHashSet.getPos (was: OpenHashSet returns invalid position when the data size is 1) > Use while(true) in OpenHashSet.getPos > - > > Key: SPARK-6515 > URL: https://issues.apache.org/jira/browse/SPARK-6515 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6515) OpenHashSet returns invalid position when the data size is 1
[ https://issues.apache.org/jira/browse/SPARK-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6515: - Target Version/s: 1.4.0 (was: 1.1.2, 1.2.2, 1.3.1, 1.4.0) > OpenHashSet returns invalid position when the data size is 1 > > > Key: SPARK-6515 > URL: https://issues.apache.org/jira/browse/SPARK-6515 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6515) OpenHashSet returns invalid position when the data size is 1
[ https://issues.apache.org/jira/browse/SPARK-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378967#comment-14378967 ] Apache Spark commented on SPARK-6515: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/5176 > OpenHashSet returns invalid position when the data size is 1 > > > Key: SPARK-6515 > URL: https://issues.apache.org/jira/browse/SPARK-6515 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6515) OpenHashSet returns invalid position when the data size is 1
[ https://issues.apache.org/jira/browse/SPARK-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6515: - Issue Type: Improvement (was: Bug) > OpenHashSet returns invalid position when the data size is 1 > > > Key: SPARK-6515 > URL: https://issues.apache.org/jira/browse/SPARK-6515 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6515) OpenHashSet returns invalid position when the data size is 1
[ https://issues.apache.org/jira/browse/SPARK-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6515: - Affects Version/s: (was: 1.2.1) (was: 1.3.0) (was: 1.1.1) > OpenHashSet returns invalid position when the data size is 1 > > > Key: SPARK-6515 > URL: https://issues.apache.org/jira/browse/SPARK-6515 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-5771) Number of Cores in Completed Applications of Standalone Master Web Page always be 0 if sc.stop() is called
[ https://issues.apache.org/jira/browse/SPARK-5771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reopened SPARK-5771: -- > Number of Cores in Completed Applications of Standalone Master Web Page > always be 0 if sc.stop() is called > -- > > Key: SPARK-5771 > URL: https://issues.apache.org/jira/browse/SPARK-5771 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.1 >Reporter: Liangliang Gu >Assignee: Liangliang Gu >Priority: Minor > > In Standalone mode, the number of cores in Completed Applications of the > Master Web Page will always be zero, if sc.stop() is called. > But the number will always be right, if sc.stop() is not called. > The reason maybe: > after sc.stop() is called, the function removeExecutor of class > ApplicationInfo will be called, thus reduce the variable coresGranted to > zero. The variable coresGranted is used to display the number of Cores on > the Web Page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5771) Number of Cores in Completed Applications of Standalone Master Web Page always be 0 if sc.stop() is called
[ https://issues.apache.org/jira/browse/SPARK-5771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5771: - Fix Version/s: (was: 1.4.0) > Number of Cores in Completed Applications of Standalone Master Web Page > always be 0 if sc.stop() is called > -- > > Key: SPARK-5771 > URL: https://issues.apache.org/jira/browse/SPARK-5771 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.1 >Reporter: Liangliang Gu >Assignee: Liangliang Gu >Priority: Minor > > In Standalone mode, the number of cores in Completed Applications of the > Master Web Page will always be zero, if sc.stop() is called. > But the number will always be right, if sc.stop() is not called. > The reason maybe: > after sc.stop() is called, the function removeExecutor of class > ApplicationInfo will be called, thus reduce the variable coresGranted to > zero. The variable coresGranted is used to display the number of Cores on > the Web Page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6502) HiveThriftServer2 fails to inspect underlying Hive version when compiled against Hive 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-6502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6502: -- Shepherd: Cheng Lian > HiveThriftServer2 fails to inspect underlying Hive version when compiled > against Hive 0.12.0 > > > Key: SPARK-6502 > URL: https://issues.apache.org/jira/browse/SPARK-6502 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.3.0 >Reporter: Cheng Lian > > While initializing the {{SparkContext}} in {{HiveThriftServer2}}, underlying > Hive version is set into the {{SparkConf}} as {{spark.sql.hive.version}}, so > that users can query the Hive version via {{SET spark.sql.hive.version;}}. > When compiled against Hive 0.12.0, this server replies {{}} when > users query this property. > Hive 0.13.1 is fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6505) Remove the reflection call in HiveFunctionWrapper
[ https://issues.apache.org/jira/browse/SPARK-6505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6505: -- Shepherd: Cheng Lian > Remove the reflection call in HiveFunctionWrapper > - > > Key: SPARK-6505 > URL: https://issues.apache.org/jira/browse/SPARK-6505 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.2.1, 1.3.0 >Reporter: Cheng Lian >Assignee: Cheng Hao >Priority: Minor > > While trying to fix SPARK-4785, we introduced {{HiveFunctionWrapper}}, and > added two not so necessary reflection calls there. These calls had caused > some dependency hell problems for MapR distribution of Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6503) Create Jenkins builder for testing Spark SQL with Hive 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-6503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6503: -- Shepherd: Cheng Lian > Create Jenkins builder for testing Spark SQL with Hive 0.12.0 > - > > Key: SPARK-6503 > URL: https://issues.apache.org/jira/browse/SPARK-6503 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.2.0, 1.3.0 >Reporter: Cheng Lian > > Currently, from the perspective of Spark SQL, the {{dev/run-tests}} script > does the following steps to build and test Spark SQL: > # Builds Spark SQL against Hive 0.12.0 to check for compilation errors > # Cleans the build > # Builds Spark SQL against Hive 0.13.1 to check for compilation errors. > # Runs unit tests against Hive 0.13.1 > Apparently, Spark SQL with Hive 0.12.0 is not tested. > Two improvements could be done here: > # When executed, {{dev/run-tests}} should always build and test a single > version of Hive. The version could be passed in as environment variable. > # Separate Jenkins builders should be set up to test Hive 0.12.0 code paths. > We probably only want the PR builder run against Hive 0.13.1 to minimize > build time, and make the master builder take care of both Hive versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6507) Create separate Hive Driver instance for each SQL query in HiveThriftServer2
[ https://issues.apache.org/jira/browse/SPARK-6507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6507: -- Shepherd: Cheng Lian > Create separate Hive Driver instance for each SQL query in HiveThriftServer2 > > > Key: SPARK-6507 > URL: https://issues.apache.org/jira/browse/SPARK-6507 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 >Reporter: Cheng Lian > > In the current implementation of HiveThriftServer2, Hive {{Driver}} instances > are cached and reused among queries. However, {{Driver}} is not thread-safe, > and may cause racing conditions. In SPARK-4908, we synchronized > {{HiveContext.runHive}} to avoid this issue, but this affects concurrency > negatively, because no two native commands can be executed concurrently. This > is pretty bad for heavy commands like ANALYZE. > Please refer [this > comment|https://issues.apache.org/jira/browse/SPARK-4908?focusedCommentId=14264469&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14264469] > in SPARK-4908 for details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6387) HTTP mode of HiveThriftServer2 doesn't work when built with Hive 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-6387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6387: -- Shepherd: Cheng Lian > HTTP mode of HiveThriftServer2 doesn't work when built with Hive 0.12.0 > --- > > Key: SPARK-6387 > URL: https://issues.apache.org/jira/browse/SPARK-6387 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.2.1, 1.3.0 >Reporter: Cheng Lian > > Reproduction steps: > # Compile Spark against Hive 0.12.0 > {noformat}$ ./build/sbt > -Pyarn,hadoop-2.4,hive,hive-thriftserver,hive-0.12.0,scala-2.10 > -Dhadoop.version=2.4.1 clean assembly/assembly{noformat} > # Start the Thrift server in HTTP mode > Add the following stanza in {{hive-site.xml}}: > {noformat} > hive.server2.transport.mode > http > {noformat} > and > {noformat}$ ./bin/start-thriftserver.sh{noformat} > # Connect to the Thrift server via Beeline > {noformat}$ ./bin/beeline -u > "jdbc:hive2://localhost:10001/default?hive.server2.transport.mode=http;hive.server2.thrift.http.path=cliservice"{noformat} > # Execute any query and check the server log > We can see that no query execution related logs are output. > The reason is that, when running under HTTP mode, although we pass in a > {{SparkSQLCLIService}} instance > ([here|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L102]) > to {{ThriftHttpCLIService}}, Hive 0.12.0 just ignores it, and instantiate a > new {{CLIService}} > ([here|https://github.com/apache/hive/blob/release-0.12.0/service/src/java/org/apache/hive/service/cli/thrift/ThriftHttpCLIService.java#L91-L92] > and > [here|https://github.com/apache/hive/blob/release-0.12.0/service/src/java/org/apache/hive/service/cli/thrift/EmbeddedThriftBinaryCLIService.java#L32]). > Notice that while compiling against Hive 0.13.1, Spark SQL doesn't suffer > from this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6501) Blacklist Hive 0.13.1 specific tests when compiled against Hive 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-6501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6501: -- Shepherd: Cheng Lian > Blacklist Hive 0.13.1 specific tests when compiled against Hive 0.12.0 > -- > > Key: SPARK-6501 > URL: https://issues.apache.org/jira/browse/SPARK-6501 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.2.0, 1.2.1, 1.3.0 >Reporter: Cheng Lian > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6109) Unit tests fail when compiled against Hive 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-6109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6109: -- Shepherd: Cheng Lian Assignee: (was: Cheng Lian) > Unit tests fail when compiled against Hive 0.12.0 > - > > Key: SPARK-6109 > URL: https://issues.apache.org/jira/browse/SPARK-6109 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Cheng Lian > > Currently, Jenkins doesn't run unit tests against Hive 0.12.0, and several > Hive 0.13.1 specific test cases always fail against Hive 0.12.0. Need to > blacklist them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6516) Coupling between default Hadoop versions in Spark build vs. ec2 scripts
Joseph K. Bradley created SPARK-6516: Summary: Coupling between default Hadoop versions in Spark build vs. ec2 scripts Key: SPARK-6516 URL: https://issues.apache.org/jira/browse/SPARK-6516 Project: Spark Issue Type: Improvement Components: Build, EC2 Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor When we change default Hadoop versions in the Spark build conf and/or in the EC2 scripts, we should keep the 2 in synch. (When out of synch, users may be surprised if they create an EC2 cluster, compile Spark on it, and try to run that version of Spark.) Making sure this is set in the same place would be great. An even better fix might be for Spark build to check for what is available and adjust the default based on that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6413) For data source tables, we should provide better output for described extended/formatted.
[ https://issues.apache.org/jira/browse/SPARK-6413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6413: Assignee: Yin Huai > For data source tables, we should provide better output for described > extended/formatted. > - > > Key: SPARK-6413 > URL: https://issues.apache.org/jira/browse/SPARK-6413 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > Right now, we will show Hive's stuff like SerDe. Users will be confused when > they see the output of "DESCRIBE EXTENDED/FORMATTED" and think the table is > not stored in the "right" format. Actually, the table is indeed stored in the > right format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6450) Native Parquet reader does not assign table name as qualifier
[ https://issues.apache.org/jira/browse/SPARK-6450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-6450: --- Assignee: Michael Armbrust (was: Cheng Lian) > Native Parquet reader does not assign table name as qualifier > - > > Key: SPARK-6450 > URL: https://issues.apache.org/jira/browse/SPARK-6450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Anand Mohan Tumuluri >Assignee: Michael Armbrust >Priority: Blocker > > The below query was working fine till 1.3 commit > 9a151ce58b3e756f205c9f3ebbbf3ab0ba5b33fd.(Yes it definitely works at this > commit although this commit is completely unrelated) > It got broken in 1.3.0 release with an AnalysisException: resolved attributes > ... missing from (although this list contains the fields which it > reports missing) > {code} > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:189) > at > org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) > at > org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at > org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) > at com.sun.proxy.$Proxy17.executeStatementAsync(Unknown Source) > at > org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > {code} > select Orders.Country, Orders.ProductCategory,count(1) from Orders join > (select Orders.Country, count(1) CountryOrderCount from Orders where > to_date(Orders.PlacedDate) > '2015-01-01' group by Orders.Country order by > CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = > Orders.Country where to_date(Orders.PlacedDate) > '2015-01-01' group by > Orders.Country,Orders.ProductCategory; > {code} > The temporary workaround is to add explicit alias for the table Orders > {code} > select o.Country, o.ProductCategory,count(1) from Orders o join (select > r.Country, count(1) CountryOrderCount from Orders r where > to_date(r.PlacedDate) > '2015-01-01' group by r.Country order by > CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = > o.Country where to_date(o.PlacedDate) > '2015-01-01' group by > o.Country,o.ProductCategory; > {code} > However this change not only affects self joins, it also seems to affect > union queries as well, like the below query which was again working > before(commit 9a151ce) got broken > {code} > select Orders.Country,null,count(1) OrderCount from Orders group by > Orders.Country,null > union all > select null,Orders.ProductCategory,count(1) OrderCount from Orders group by > null, Orders.ProductCategory > {code} > also fails with a Analysis exception. > The workaround is to add different a
[jira] [Closed] (SPARK-3570) Shuffle write time does not include time to open shuffle files
[ https://issues.apache.org/jira/browse/SPARK-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3570. Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Target Version/s: 1.3.1, 1.4.0 > Shuffle write time does not include time to open shuffle files > -- > > Key: SPARK-3570 > URL: https://issues.apache.org/jira/browse/SPARK-3570 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.9.2, 1.0.2, 1.1.0 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout > Fix For: 1.3.1, 1.4.0 > > Attachments: 3a_1410854905_0_job_log_waterfall.pdf, > 3a_1410957857_0_job_log_waterfall.pdf > > > Currently, the reported shuffle write time does not include time to open the > shuffle files. This time can be very significant when the disk is highly > utilized and many shuffle files exist on the machine (I'm not sure how severe > this is in 1.0 onward -- since shuffle files are automatically deleted, this > may be less of an issue because there are fewer old files sitting around). > In experiments I did, in extreme cases, adding the time to open files can > increase the shuffle write time from 5ms (of a 2 second task) to 1 second. > We should fix this for better performance debugging. > Thanks [~shivaram] for helping to diagnose this problem. cc [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6088) UI is malformed when tasks fetch remote results
[ https://issues.apache.org/jira/browse/SPARK-6088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6088: - Affects Version/s: 1.3.0 > UI is malformed when tasks fetch remote results > --- > > Key: SPARK-6088 > URL: https://issues.apache.org/jira/browse/SPARK-6088 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout > Fix For: 1.3.1, 1.4.0 > > Attachments: Screenshot 2015-02-28 18.24.42.png > > > There are three issues when tasks get remote results: > (1) The status never changes from GET_RESULT to SUCCEEDED > (2) The time to get the result is shown as the absolute time (resulting in a > non-sensical output that says getting the result took >1 million hours) > rather than the elapsed time > (3) The getting result time is included as part of the scheduler delay > cc [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6088) UI is malformed when tasks fetch remote results
[ https://issues.apache.org/jira/browse/SPARK-6088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-6088. Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Target Version/s: 1.3.1, 1.4.0 (was: 1.3.0) > UI is malformed when tasks fetch remote results > --- > > Key: SPARK-6088 > URL: https://issues.apache.org/jira/browse/SPARK-6088 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout > Fix For: 1.3.1, 1.4.0 > > Attachments: Screenshot 2015-02-28 18.24.42.png > > > There are three issues when tasks get remote results: > (1) The status never changes from GET_RESULT to SUCCEEDED > (2) The time to get the result is shown as the absolute time (resulting in a > non-sensical output that says getting the result took >1 million hours) > rather than the elapsed time > (3) The getting result time is included as part of the scheduler delay > cc [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6430) Cannot resolve column correctlly when using left semi join
[ https://issues.apache.org/jira/browse/SPARK-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-6430: --- Assignee: Michael Armbrust > Cannot resolve column correctlly when using left semi join > -- > > Key: SPARK-6430 > URL: https://issues.apache.org/jira/browse/SPARK-6430 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: Spark 1.3.0 on yarn mode >Reporter: zzc >Assignee: Michael Armbrust > > My code: > {quote} > case class TestData(key: Int, value: String) > case class TestData2(a: Int, b: Int) > import org.apache.spark.sql.execution.joins._ > import sqlContext.implicits._ > val testData = sc.parallelize( > (1 to 100).map(i => TestData(i, i.toString))).toDF() > testData.registerTempTable("testData") > val testData2 = sc.parallelize( > TestData2(1, 1) :: > TestData2(1, 2) :: > TestData2(2, 1) :: > TestData2(2, 2) :: > TestData2(3, 1) :: > TestData2(3, 2) :: Nil, 2).toDF() > testData2.registerTempTable("testData2") > //val tmp = sqlContext.sql("SELECT * FROM testData *LEFT SEMI JOIN* testData2 > ON key = a ") > val tmp = sqlContext.sql("SELECT testData2.b, count(testData2.b) FROM > testData *LEFT SEMI JOIN* testData2 ON key = testData2.a group by > testData2.b") > tmp.explain() > {quote} > Error log: > {quote} > org.apache.spark.sql.AnalysisException: cannot resolve 'testData2.b' given > input columns key, value; line 1 pos 108 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > {quote} > {quote}SELECT * FROM testData LEFT SEMI JOIN testData2 ON key = a{quote} is > correct, > {quote} > SELECT a FROM testData LEFT SEMI JOIN testData2 ON key = a > SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = a group by b > SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = testData2.a > group by testData2.b > SELECT testData2.b, count(testData2.b) FROM testData LEFT SEMI JOIN testData2 > ON key = testData2.a group by testData2.b > {quote} are incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6152) Spark does not support Java 8 compiled Scala classes
[ https://issues.apache.org/jira/browse/SPARK-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378918#comment-14378918 ] Martin Grotzke commented on SPARK-6152: --- Btw, we just released kryo 3.0.1: https://github.com/EsotericSoftware/kryo/blob/master/CHANGES.md#2240---300-2014-0-04 > Spark does not support Java 8 compiled Scala classes > > > Key: SPARK-6152 > URL: https://issues.apache.org/jira/browse/SPARK-6152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1 > Environment: Java 8+ > Scala 2.11 >Reporter: Ronald Chen >Priority: Minor > > Spark uses reflectasm to check Scala closures which fails if the *user > defined Scala closures* are compiled to Java 8 class version > The cause is reflectasm does not support Java 8 > https://github.com/EsotericSoftware/reflectasm/issues/35 > Workaround: > Don't compile Scala classes to Java 8, Scala 2.11 does not support nor > require any Java 8 features > Stack trace: > {code} > java.lang.IllegalArgumentException > at > com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown > Source) > at > com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown > Source) > at > com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown > Source) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$getClassReader(ClosureCleaner.scala:41) > at > org.apache.spark.util.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:84) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:107) > at org.apache.spark.SparkContext.clean(SparkContext.scala:1478) > at org.apache.spark.rdd.RDD.map(RDD.scala:288) > at ...my Scala 2.11 compiled to Java 8 code calling into spark > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6510) Add Graph#minus method to act as Set#difference
[ https://issues.apache.org/jira/browse/SPARK-6510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378913#comment-14378913 ] Apache Spark commented on SPARK-6510: - User 'brennonyork' has created a pull request for this issue: https://github.com/apache/spark/pull/5175 > Add Graph#minus method to act as Set#difference > --- > > Key: SPARK-6510 > URL: https://issues.apache.org/jira/browse/SPARK-6510 > Project: Spark > Issue Type: Improvement >Reporter: Brennon York > > Right now GraphX does not have a Set#difference method to operate on > VertexIds. We do however have a {{diff}} method although that works on > values. Given the optimizations of tombstoning already present this method > can be implemented in a very efficient manner. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378877#comment-14378877 ] Yu Ishikawa commented on SPARK-2429: I got it. Thanks! > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: Yu Ishikawa >Priority: Minor > Labels: clustering > Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The > Result of Benchmarking a Hierarchical Clustering.pdf, > benchmark-result.2014-10-29.html, benchmark2.html > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5508) Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API
[ https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378848#comment-14378848 ] Ryan Blue commented on SPARK-5508: -- [~yhuai], we've been working to standardize nested types lately so we now have a committed spec for how lists and maps should be written. In addition, we've identified all of the ways they have been represented in Parquet and set some backward-compatibility rules. The Parquet issue is PARQUET-113 and you can look at the rules in [the spec|https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md]. I implemented those rules in Hive, parquet-avro, and parquet-thrift, so feel free to ping me if you have questions. > Arrays and Maps stored with Hive Parquet Serde may not be able to read by the > Parquet support in the Data Souce API > --- > > Key: SPARK-5508 > URL: https://issues.apache.org/jira/browse/SPARK-5508 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 > Environment: mesos, cdh >Reporter: Ayoub Benali >Assignee: Cheng Lian >Priority: Critical > Labels: hivecontext, parquet > > *The root cause of this bug is explained below ([see > here|https://issues.apache.org/jira/secure/EditComment!default.jspa?id=12771559&commentId=14368505]).* > > *The workaround of this issue is to set the following confs* > {code} > sql("set spark.sql.parquet.useDataSourceApi=false") > sql("set spark.sql.hive.convertMetastoreParquet=false") > {code} > *Below is the original description.* > When the table is saved as parquet, we cannot query a field which is an array > of struct after an INSERT statement, like show bellow: > {noformat} > scala> val data1="""{ > | "timestamp": 1422435598, > | "data_array": [ > | { > | "field1": 1, > | "field2": 2 > | } > | ] > | }""" > scala> val data2="""{ > | "timestamp": 1422435598, > | "data_array": [ > | { > | "field1": 3, > | "field2": 4 > | } > | ] > scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) > scala> val rdd = hiveContext.jsonRDD(jsonRDD) > scala> rdd.printSchema > root > |-- data_array: array (nullable = true) > ||-- element: struct (containsNull = false) > |||-- field1: integer (nullable = true) > |||-- field2: integer (nullable = true) > |-- timestamp: integer (nullable = true) > scala> rdd.registerTempTable("tmp_table") > scala> hiveContext.sql("select data.field1 from tmp_table LATERAL VIEW > explode(data_array) nestedStuff AS data").collect > res3: Array[org.apache.spark.sql.Row] = Array([1], [3]) > scala> hiveContext.sql("SET hive.exec.dynamic.partition = true") > scala> hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict") > scala> hiveContext.sql("set parquet.compression=GZIP") > scala> hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") > scala> hiveContext.sql("create external table if not exists > persisted_table(data_array ARRAY >, > timestamp INT) STORED AS PARQUET Location 'hdfs:///test_table'") > scala> hiveContext.sql("insert into table persisted_table select * from > tmp_table").collect > scala> hiveContext.sql("select data.field1 from persisted_table LATERAL VIEW > explode(data_array) nestedStuff AS data").collect > parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in > file hdfs://*/test_table/part-1 > at > parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) > at > parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.c
[jira] [Commented] (SPARK-6481) Set "In Progress" when a PR is opened for an issue
[ https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378838#comment-14378838 ] Nicholas Chammas commented on SPARK-6481: - Since there is no guaranteed way to map GitHub usernames to JIRA usernames, what should we do about the JIRA assignee? A JIRA issue needs an assignee in order to be marked "In Progress". We can have the script: # always assign the issue to the Apache Spark user # keep it assigned to whoever has it assigned, if any (this may be different from the PR user) # in the case of no current assignee, assign to Apache Spark just to mark the JIRA in progress, then remove assignee Any preferences [~marmbrus] / [~pwendell]? > Set "In Progress" when a PR is opened for an issue > -- > > Key: SPARK-6481 > URL: https://issues.apache.org/jira/browse/SPARK-6481 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Michael Armbrust >Assignee: Nicholas Chammas > > [~pwendell] and I are not sure if this is possible, but it would be really > helpful if the JIRA status was updated to "In Progress" when we do the > linking to an open pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6476) Spark fileserver not started on same IP as using spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rares Vernica resolved SPARK-6476. -- Resolution: Not a Problem After investigating more and trying the suggested change, I think this is not a problem. The server is not listening to a specific IP so any of the IPs can be used to access the server. The spark.fileserver.uri reported lists only one of the IPs. The original problem reported on the mailing list is not caused by this. The original problem was due to a misconfiguration of the input file used in the Spark job. > Spark fileserver not started on same IP as using spark.driver.host > -- > > Key: SPARK-6476 > URL: https://issues.apache.org/jira/browse/SPARK-6476 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1 >Reporter: Rares Vernica > > I initially inquired about this here: > http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E > If the Spark driver host has multiple IPs and spark.driver.host is set to one > of them, I would expect the fileserver to start on the same IP. I checked > HttpServer and the jetty Server is started the default IP of the machine: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75 > Something like this might work instead: > {code:title=HttpServer.scala#L75} > val server = new Server(new InetSocketAddress(conf.get("spark.driver.host"), > 0)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6515) OpenHashSet returns invalid position when the data size is 1
Xiangrui Meng created SPARK-6515: Summary: OpenHashSet returns invalid position when the data size is 1 Key: SPARK-6515 URL: https://issues.apache.org/jira/browse/SPARK-6515 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0, 1.2.1, 1.1.1 Reporter: Xiangrui Meng Assignee: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6385) ISO 8601 timestamp parsing does not support arbitrary precision second fractions
[ https://issues.apache.org/jira/browse/SPARK-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378836#comment-14378836 ] Nick Bruun commented on SPARK-6385: --- Strictly speaking, the ISO 8601 standard does not define a fixed precision for decimal fractions of seconds (or minutes or hours for that matter.) Many sources of JSON data will output greater than millisecond precision second decimal fractions (the "validity" of the precision in terms of reasoning is a different matter), so in my opinion, Spark should at least support this (and also shorter notations, where trailing zeros have been trimmed), if not the entire ISO 8601 date/time standard, although that *is* probably erring on the side of pedantic. Alternatively, this could be implemented as a standalone library, but that raises the question of library dependencies in Spark. > ISO 8601 timestamp parsing does not support arbitrary precision second > fractions > > > Key: SPARK-6385 > URL: https://issues.apache.org/jira/browse/SPARK-6385 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 >Reporter: Nick Bruun >Priority: Minor > > The ISO 8601 timestamp parsing implemented as a resolution to SPARK-4149 does > not support arbitrary precision fractions of seconds, only millisecond > precision. Parsing {{2015-02-02T00:00:07.900GMT-00:00}} will succeed, while > {{2015-02-02T00:00:07.9000GMT-00:00}} will fail. > The issue is caused by the fixed precision of the parsed format in > [DataTypeConversions.scala#L66|https://github.com/apache/spark/blob/84acd08e0886aa23195f35837c15c09aa7804aff/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataTypeConversions.scala#L66]. > I'm willing to implement a fix, but pointers on the direction would be > appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself
Chris Fregly created SPARK-6514: --- Summary: For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself Key: SPARK-6514 URL: https://issues.apache.org/jira/browse/SPARK-6514 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Chris Fregly this was not supported when i originally wrote this receiver. this is now supported. also, upgrade to the latest Kinesis Client Library (KCL) which is 1.2, i believe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6513) Add zipWithUniqueId (and other RDD APIs) to RDDApi
[ https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eran Medan updated SPARK-6513: -- Description: It will be nice if we could treat a Dataframe just like an RDD (wherever it makes sense) *Worked in 1.2.1* {code} val sqlContext = new HiveContext(sc) import sqlContext._ val jsonRDD = sqlContext.jsonFile(jsonFilePath) jsonRDD.registerTempTable("jsonTable") val jsonResult = sql(s"select * from jsonTable") val foo = jsonResult.zipWithUniqueId().map { case (Row(...), uniqueId) => // do something useful ... } foo.registerTempTable("...") {code} *Stopped working in 1.3.0* {code} jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method {code} **Not working workaround:** although this might give me an {{RDD\[Row\]}}: {code} jsonResult.rdd.zipWithUniqueId() {code} Now this won't work obviously since {{RDD\[Row\]}} does not have a {{registerTempTable}} method of course {code} foo.registerTempTable("...") {code} (see related SO question: http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3) EDIT: changed from issue to enhancement request was: It will be nice if we could treat a Dataframe just like an RDD (wherever it makes sense) *Worked in 1.2.1* {code} val sqlContext = new HiveContext(sc) import sqlContext._ val jsonRDD = sqlContext.jsonFile(jsonFilePath) jsonRDD.registerTempTable("jsonTable") val jsonResult = sql(s"select * from jsonTable") val foo = jsonResult.zipWithUniqueId().map { case (Row(...), uniqueId) => // do something useful ... } foo.registerTempTable("...") {code} *Stopped working in 1.3.0* {code} jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method {code} **Not working workaround:** although this might give me an {{RDD\[Row\]}}: {code} jsonResult.map(identity).zipWithUniqueId() {code} Now this won't work obviously since {{RDD\[Row\]}} does not have a {{registerTempTable}} method of course {code} foo.registerTempTable("...") {code} (see related SO question: http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3) EDIT: changed from issue to enhancement request > Add zipWithUniqueId (and other RDD APIs) to RDDApi > -- > > Key: SPARK-6513 > URL: https://issues.apache.org/jira/browse/SPARK-6513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 > Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I > don't think it's relevant) >Reporter: Eran Medan >Priority: Minor > > It will be nice if we could treat a Dataframe just like an RDD (wherever it > makes sense) > *Worked in 1.2.1* > {code} > val sqlContext = new HiveContext(sc) > import sqlContext._ > val jsonRDD = sqlContext.jsonFile(jsonFilePath) > jsonRDD.registerTempTable("jsonTable") > val jsonResult = sql(s"select * from jsonTable") > val foo = jsonResult.zipWithUniqueId().map { >case (Row(...), uniqueId) => // do something useful >... > } > foo.registerTempTable("...") > {code} > *Stopped working in 1.3.0* > {code} > jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method > {code} > **Not working workaround:** > although this might give me an {{RDD\[Row\]}}: > {code} > jsonResult.rdd.zipWithUniqueId() > {code} > Now this won't work obviously since {{RDD\[Row\]}} does not have a > {{registerTempTable}} method of course > {code} > foo.registerTempTable("...") > {code} > (see related SO question: > http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3) > EDIT: changed from issue to enhancement request -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6481) Set "In Progress" when a PR is opened for an issue
[ https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6481: --- Assignee: Nicholas Chammas (was: Apache Spark) > Set "In Progress" when a PR is opened for an issue > -- > > Key: SPARK-6481 > URL: https://issues.apache.org/jira/browse/SPARK-6481 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Michael Armbrust >Assignee: Nicholas Chammas > > [~pwendell] and I are not sure if this is possible, but it would be really > helpful if the JIRA status was updated to "In Progress" when we do the > linking to an open pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6385) ISO 8601 timestamp parsing does not support arbitrary precision second fractions
[ https://issues.apache.org/jira/browse/SPARK-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378816#comment-14378816 ] Michael Armbrust commented on SPARK-6385: - Oh, I see. Looks like this is actually correct, but not what you want http://stackoverflow.com/questions/12000673/string-date-conversion-with-nanoseconds. Is the format you are describing part of the standard? I'm not opposed to us doing something custom (assuming its well tested) if we have to, but I'd like to avoid adding too many non-standard semantics. > ISO 8601 timestamp parsing does not support arbitrary precision second > fractions > > > Key: SPARK-6385 > URL: https://issues.apache.org/jira/browse/SPARK-6385 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 >Reporter: Nick Bruun >Priority: Minor > > The ISO 8601 timestamp parsing implemented as a resolution to SPARK-4149 does > not support arbitrary precision fractions of seconds, only millisecond > precision. Parsing {{2015-02-02T00:00:07.900GMT-00:00}} will succeed, while > {{2015-02-02T00:00:07.9000GMT-00:00}} will fail. > The issue is caused by the fixed precision of the parsed format in > [DataTypeConversions.scala#L66|https://github.com/apache/spark/blob/84acd08e0886aa23195f35837c15c09aa7804aff/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataTypeConversions.scala#L66]. > I'm willing to implement a fix, but pointers on the direction would be > appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6513) Add zipWithUniqueId (and other RDD APIs) to RDDApi
[ https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eran Medan updated SPARK-6513: -- Summary: Add zipWithUniqueId (and other RDD APIs) to RDDApi (was: Add zipWithUniqueId (and other RDD APIs) to RDDApi.scala) > Add zipWithUniqueId (and other RDD APIs) to RDDApi > -- > > Key: SPARK-6513 > URL: https://issues.apache.org/jira/browse/SPARK-6513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 > Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I > don't think it's relevant) >Reporter: Eran Medan >Priority: Minor > > It will be nice if we could treat a Dataframe just like an RDD (wherever it > makes sense) > *Worked in 1.2.1* > {code} > val sqlContext = new HiveContext(sc) > import sqlContext._ > val jsonRDD = sqlContext.jsonFile(jsonFilePath) > jsonRDD.registerTempTable("jsonTable") > val jsonResult = sql(s"select * from jsonTable") > val foo = jsonResult.zipWithUniqueId().map { >case (Row(...), uniqueId) => // do something useful >... > } > foo.registerTempTable("...") > {code} > *Stopped working in 1.3.0* > {code} > jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method > {code} > **Not working workaround:** > although this might give me an {{RDD\[Row\]}}: > {code} > jsonResult.map(identity).zipWithUniqueId() > {code} > Now this won't work obviously since {{RDD\[Row\]}} does not have a > {{registerTempTable}} method of course > {code} > foo.registerTempTable("...") > {code} > (see related SO question: > http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3) > EDIT: changed from issue to enhancement request -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6513) Add zipWithUniqueId (and other RDD APIs) to RDDApi.scala
[ https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eran Medan updated SPARK-6513: -- Description: It will be nice if we could treat a Dataframe just like an RDD (wherever it makes sense) *Worked in 1.2.1* {code} val sqlContext = new HiveContext(sc) import sqlContext._ val jsonRDD = sqlContext.jsonFile(jsonFilePath) jsonRDD.registerTempTable("jsonTable") val jsonResult = sql(s"select * from jsonTable") val foo = jsonResult.zipWithUniqueId().map { case (Row(...), uniqueId) => // do something useful ... } foo.registerTempTable("...") {code} *Stopped working in 1.3.0* {code} jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method {code} **Not working workaround:** although this might give me an {{RDD\[Row\]}}: {code} jsonResult.map(identity).zipWithUniqueId() {code} Now this won't work obviously since {{RDD\[Row\]}} does not have a {{registerTempTable}} method of course {code} foo.registerTempTable("...") {code} (see related SO question: http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3) EDIT: changed from issue to enhancement request was: I'm sure this has an Issue somewhere but I can't find it. I see this is not a regression bug (since Ap, since it compiled in 1.2.1 but stopped in 1.3 without any earlier deprecation warnings, but I am sure the authors are well aware, so please change it to an enhancement request if you disagree this is a regression. It's such an obvious and blunt regression that I doubt it was done without a lot of thought and I'm sure there was a good reason, but still it breaks my code and I don't have a workaround :) Here are the details / steps to reproduce *Worked in 1.2.1* (without any deprecation warnings) {code} val sqlContext = new HiveContext(sc) import sqlContext._ val jsonRDD = sqlContext.jsonFile(jsonFilePath) jsonRDD.registerTempTable("jsonTable") val jsonResult = sql(s"select * from jsonTable") val foo = jsonResult.zipWithUniqueId().map { case (Row(...), uniqueId) => // do something useful ... } foo.registerTempTable("...") {code} *Stopped working in 1.3.0* (simply does not compile, and all I did was change to 1.3) {code} jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method {code} **Not working workaround:** although this might give me an {{RDD\[Row\]}}: {code} jsonResult.map(identity).zipWithUniqueId() {code} Now this won't work obviously since {{RDD\[Row\]}} does not have a {{registerTempTable}} method of course {code} foo.registerTempTable("...") {code} (see related SO question: http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3) > Add zipWithUniqueId (and other RDD APIs) to RDDApi.scala > > > Key: SPARK-6513 > URL: https://issues.apache.org/jira/browse/SPARK-6513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 > Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I > don't think it's relevant) >Reporter: Eran Medan >Priority: Minor > > It will be nice if we could treat a Dataframe just like an RDD (wherever it > makes sense) > *Worked in 1.2.1* > {code} > val sqlContext = new HiveContext(sc) > import sqlContext._ > val jsonRDD = sqlContext.jsonFile(jsonFilePath) > jsonRDD.registerTempTable("jsonTable") > val jsonResult = sql(s"select * from jsonTable") > val foo = jsonResult.zipWithUniqueId().map { >case (Row(...), uniqueId) => // do something useful >... > } > foo.registerTempTable("...") > {code} > *Stopped working in 1.3.0* > {code} > jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method > {code} > **Not working workaround:** > although this might give me an {{RDD\[Row\]}}: > {code} > jsonResult.map(identity).zipWithUniqueId() > {code} > Now this won't work obviously since {{RDD\[Row\]}} does not have a > {{registerTempTable}} method of course > {code} > foo.registerTempTable("...") > {code} > (see related SO question: > http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3) > EDIT: changed from issue to enhancement request -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6513) Regression - missing zipWithUniqueId (and other RDD APIs) in RDDApi.scala
[ https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eran Medan updated SPARK-6513: -- Description: I'm sure this has an Issue somewhere but I can't find it. I see this is not a regression bug (since Ap, since it compiled in 1.2.1 but stopped in 1.3 without any earlier deprecation warnings, but I am sure the authors are well aware, so please change it to an enhancement request if you disagree this is a regression. It's such an obvious and blunt regression that I doubt it was done without a lot of thought and I'm sure there was a good reason, but still it breaks my code and I don't have a workaround :) Here are the details / steps to reproduce *Worked in 1.2.1* (without any deprecation warnings) {code} val sqlContext = new HiveContext(sc) import sqlContext._ val jsonRDD = sqlContext.jsonFile(jsonFilePath) jsonRDD.registerTempTable("jsonTable") val jsonResult = sql(s"select * from jsonTable") val foo = jsonResult.zipWithUniqueId().map { case (Row(...), uniqueId) => // do something useful ... } foo.registerTempTable("...") {code} *Stopped working in 1.3.0* (simply does not compile, and all I did was change to 1.3) {code} jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method {code} **Not working workaround:** although this might give me an {{RDD\[Row\]}}: {code} jsonResult.map(identity).zipWithUniqueId() {code} Now this won't work obviously since {{RDD\[Row\]}} does not have a {{registerTempTable}} method of course {code} foo.registerTempTable("...") {code} (see related SO question: http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3) was: I'm sure this has an Issue somewhere but I can't find it. I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 without any earlier deprecation warnings, but I am sure the authors are well aware, so please change it to an enhancement request if you disagree this is a regression. It's such an obvious and blunt regression that I doubt it was done without a lot of thought and I'm sure there was a good reason, but still it breaks my code and I don't have a workaround :) Here are the details / steps to reproduce *Worked in 1.2.1* (without any deprecation warnings) {code} val sqlContext = new HiveContext(sc) import sqlContext._ val jsonRDD = sqlContext.jsonFile(jsonFilePath) jsonRDD.registerTempTable("jsonTable") val jsonResult = sql(s"select * from jsonTable") val foo = jsonResult.zipWithUniqueId().map { case (Row(...), uniqueId) => // do something useful ... } foo.registerTempTable("...") {code} *Stopped working in 1.3.0* (simply does not compile, and all I did was change to 1.3) {code} jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method {code} **Not working workaround:** although this might give me an {{RDD\[Row\]}}: {code} jsonResult.map(identity).zipWithUniqueId() {code} Now this won't work obviously since {{RDD\[Row\]}} does not have a {{registerTempTable}} method of course {code} foo.registerTempTable("...") {code} (see related SO question: http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3) > Regression - missing zipWithUniqueId (and other RDD APIs) in RDDApi.scala > - > > Key: SPARK-6513 > URL: https://issues.apache.org/jira/browse/SPARK-6513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 > Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I > don't think it's relevant) >Reporter: Eran Medan >Priority: Minor > > I'm sure this has an Issue somewhere but I can't find it. > I see this is not a regression bug (since Ap, since it compiled in 1.2.1 but > stopped in 1.3 without any earlier deprecation warnings, but I am sure the > authors are well aware, so please change it to an enhancement request if you > disagree this is a regression. It's such an obvious and blunt regression that > I doubt it was done without a lot of thought and I'm sure there was a good > reason, but still it breaks my code and I don't have a workaround :) > Here are the details / steps to reproduce > *Worked in 1.2.1* (without any deprecation warnings) > {code} > val sqlContext = new HiveContext(sc) > import sqlContext._ > val jsonRDD = sqlContext.jsonFile(jsonFilePath) > jsonRDD.registerTempTable("jsonTable") > val jsonResult = sql(s"select * from jsonTable") > val foo = jsonResult.zipWithUniqueId().map { >case (Row(...), uniqueId) => // do something useful >... > } > foo.registerTempTable("...
[jira] [Updated] (SPARK-6513) Regression - missing zipWithUniqueId (and other RDD APIs) in RDDApi.scala
[ https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eran Medan updated SPARK-6513: -- Issue Type: Improvement (was: Bug) > Regression - missing zipWithUniqueId (and other RDD APIs) in RDDApi.scala > - > > Key: SPARK-6513 > URL: https://issues.apache.org/jira/browse/SPARK-6513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 > Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I > don't think it's relevant) >Reporter: Eran Medan >Priority: Blocker > > I'm sure this has an Issue somewhere but I can't find it. > I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 > without any earlier deprecation warnings, but I am sure the authors are well > aware, so please change it to an enhancement request if you disagree this is > a regression. It's such an obvious and blunt regression that I doubt it was > done without a lot of thought and I'm sure there was a good reason, but still > it breaks my code and I don't have a workaround :) > Here are the details / steps to reproduce > *Worked in 1.2.1* (without any deprecation warnings) > {code} > val sqlContext = new HiveContext(sc) > import sqlContext._ > val jsonRDD = sqlContext.jsonFile(jsonFilePath) > jsonRDD.registerTempTable("jsonTable") > val jsonResult = sql(s"select * from jsonTable") > val foo = jsonResult.zipWithUniqueId().map { >case (Row(...), uniqueId) => // do something useful >... > } > foo.registerTempTable("...") > {code} > *Stopped working in 1.3.0* (simply does not compile, and all I did was change > to 1.3) > {code} > jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method > {code} > **Not working workaround:** > although this might give me an {{RDD\[Row\]}}: > {code} > jsonResult.map(identity).zipWithUniqueId() > {code} > Now this won't work obviously since {{RDD\[Row\]}} does not have a > {{registerTempTable}} method of course > {code} > foo.registerTempTable("...") > {code} > (see related SO question: > http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6513) Add zipWithUniqueId (and other RDD APIs) to RDDApi.scala
[ https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eran Medan updated SPARK-6513: -- Summary: Add zipWithUniqueId (and other RDD APIs) to RDDApi.scala (was: Regression - missing zipWithUniqueId (and other RDD APIs) in RDDApi.scala) > Add zipWithUniqueId (and other RDD APIs) to RDDApi.scala > > > Key: SPARK-6513 > URL: https://issues.apache.org/jira/browse/SPARK-6513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 > Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I > don't think it's relevant) >Reporter: Eran Medan >Priority: Minor > > I'm sure this has an Issue somewhere but I can't find it. > I see this is not a regression bug (since Ap, since it compiled in 1.2.1 but > stopped in 1.3 without any earlier deprecation warnings, but I am sure the > authors are well aware, so please change it to an enhancement request if you > disagree this is a regression. It's such an obvious and blunt regression that > I doubt it was done without a lot of thought and I'm sure there was a good > reason, but still it breaks my code and I don't have a workaround :) > Here are the details / steps to reproduce > *Worked in 1.2.1* (without any deprecation warnings) > {code} > val sqlContext = new HiveContext(sc) > import sqlContext._ > val jsonRDD = sqlContext.jsonFile(jsonFilePath) > jsonRDD.registerTempTable("jsonTable") > val jsonResult = sql(s"select * from jsonTable") > val foo = jsonResult.zipWithUniqueId().map { >case (Row(...), uniqueId) => // do something useful >... > } > foo.registerTempTable("...") > {code} > *Stopped working in 1.3.0* (simply does not compile, and all I did was change > to 1.3) > {code} > jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method > {code} > **Not working workaround:** > although this might give me an {{RDD\[Row\]}}: > {code} > jsonResult.map(identity).zipWithUniqueId() > {code} > Now this won't work obviously since {{RDD\[Row\]}} does not have a > {{registerTempTable}} method of course > {code} > foo.registerTempTable("...") > {code} > (see related SO question: > http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6513) Regression - missing zipWithUniqueId (and other RDD APIs) in RDDApi.scala
[ https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eran Medan updated SPARK-6513: -- Priority: Minor (was: Blocker) > Regression - missing zipWithUniqueId (and other RDD APIs) in RDDApi.scala > - > > Key: SPARK-6513 > URL: https://issues.apache.org/jira/browse/SPARK-6513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 > Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I > don't think it's relevant) >Reporter: Eran Medan >Priority: Minor > > I'm sure this has an Issue somewhere but I can't find it. > I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 > without any earlier deprecation warnings, but I am sure the authors are well > aware, so please change it to an enhancement request if you disagree this is > a regression. It's such an obvious and blunt regression that I doubt it was > done without a lot of thought and I'm sure there was a good reason, but still > it breaks my code and I don't have a workaround :) > Here are the details / steps to reproduce > *Worked in 1.2.1* (without any deprecation warnings) > {code} > val sqlContext = new HiveContext(sc) > import sqlContext._ > val jsonRDD = sqlContext.jsonFile(jsonFilePath) > jsonRDD.registerTempTable("jsonTable") > val jsonResult = sql(s"select * from jsonTable") > val foo = jsonResult.zipWithUniqueId().map { >case (Row(...), uniqueId) => // do something useful >... > } > foo.registerTempTable("...") > {code} > *Stopped working in 1.3.0* (simply does not compile, and all I did was change > to 1.3) > {code} > jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method > {code} > **Not working workaround:** > although this might give me an {{RDD\[Row\]}}: > {code} > jsonResult.map(identity).zipWithUniqueId() > {code} > Now this won't work obviously since {{RDD\[Row\]}} does not have a > {{registerTempTable}} method of course > {code} > foo.registerTempTable("...") > {code} > (see related SO question: > http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6385) ISO 8601 timestamp parsing does not support arbitrary precision second fractions
[ https://issues.apache.org/jira/browse/SPARK-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378801#comment-14378801 ] Nick Bruun commented on SPARK-6385: --- An extra {{S}} does not seem to do the trick, as the resulting date ({{res2}}) is incorrect ({{:16}} rather than {{:07}}.) I've looked through a series of libraries, and all seem to be doing it in the same way ({{SSS}} and that's it), so I'm considering writing a proper parser instead. What is the position on having this level of complexity in Spark? > ISO 8601 timestamp parsing does not support arbitrary precision second > fractions > > > Key: SPARK-6385 > URL: https://issues.apache.org/jira/browse/SPARK-6385 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 >Reporter: Nick Bruun >Priority: Minor > > The ISO 8601 timestamp parsing implemented as a resolution to SPARK-4149 does > not support arbitrary precision fractions of seconds, only millisecond > precision. Parsing {{2015-02-02T00:00:07.900GMT-00:00}} will succeed, while > {{2015-02-02T00:00:07.9000GMT-00:00}} will fail. > The issue is caused by the fixed precision of the parsed format in > [DataTypeConversions.scala#L66|https://github.com/apache/spark/blob/84acd08e0886aa23195f35837c15c09aa7804aff/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataTypeConversions.scala#L66]. > I'm willing to implement a fix, but pointers on the direction would be > appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6481) Set "In Progress" when a PR is opened for an issue
[ https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6481: --- Assignee: Apache Spark (was: Nicholas Chammas) > Set "In Progress" when a PR is opened for an issue > -- > > Key: SPARK-6481 > URL: https://issues.apache.org/jira/browse/SPARK-6481 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Michael Armbrust >Assignee: Apache Spark > > [~pwendell] and I are not sure if this is possible, but it would be really > helpful if the JIRA status was updated to "In Progress" when we do the > linking to an open pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6209) ExecutorClassLoader can leak connections after failing to load classes from the REPL class server
[ https://issues.apache.org/jira/browse/SPARK-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378788#comment-14378788 ] Apache Spark commented on SPARK-6209: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/5174 > ExecutorClassLoader can leak connections after failing to load classes from > the REPL class server > - > > Key: SPARK-6209 > URL: https://issues.apache.org/jira/browse/SPARK-6209 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0, 1.0.3, 1.1.2, 1.2.1, 1.3.0, 1.4.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.3.1, 1.4.0 > > > ExecutorClassLoader does not ensure proper cleanup of network connections > that it opens. If it fails to load a class, it may leak partially-consumed > InputStreams that are connected to the REPL's HTTP class server, causing that > server to exhaust its thread pool, which can cause the entire job to hang. > Here is a simple reproduction: > With > {code} > ./bin/spark-shell --master local-cluster[8,8,512] > {code} > run the following command: > {code} > sc.parallelize(1 to 1000, 1000).map { x => > try { > Class.forName("some.class.that.does.not.Exist") > } catch { > case e: Exception => // do nothing > } > x > }.count() > {code} > This job will run 253 tasks, then will completely freeze without any errors > or failed tasks. > It looks like the driver has 253 threads blocked in socketRead0() calls: > {code} > [joshrosen ~]$ jstack 16765 | grep socketRead0 | wc > 253 759 14674 > {code} > e.g. > {code} > "qtp1287429402-13" daemon prio=5 tid=0x7f868a1c nid=0x5b03 runnable > [0x0001159bd000] >java.lang.Thread.State: RUNNABLE > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:152) > at java.net.SocketInputStream.read(SocketInputStream.java:122) > at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391) > at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141) > at > org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227) > at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1044) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280) > at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) > at > org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) > at > org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > at java.lang.Thread.run(Thread.java:745) > {code} > Jstack on the executors shows blocking in loadClass / findClass, where a > single thread is RUNNABLE and waiting to hear back from the driver and other > executor threads are BLOCKED on object monitor synchronization at > Class.forName0(). > Remotely triggering a GC on a hanging executor allows the job to progress and > complete more tasks before hanging again. If I repeatedly trigger GC on all > of the executors, then the job runs to completion: > {code} > jps | grep CoarseGra | cut -d ' ' -f 1 | xargs -I {} -n 1 -P100 jcmd {} GC.run > {code} > The culprit is a {{catch}} block that ignores all exceptions and performs no > cleanup: > https://github.com/apache/spark/blob/v1.2.0/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L94 > This bug has been present since Spark 1.0.0, but I suspect that we haven't > seen it before because it's pretty hard to reproduce. Triggering this error > requires a job with tasks that trigger ClassNotFoundExceptions yet are still > able to run to completion. It also requires that executors are able to leak > enough open connections to exhaust the class server's Jetty thread pool > limit, which requires that there are a large number of tasks (253+) and > either a large number of executors or a very low amount of GC pressure on > those executors (since GC will cause the leaked connections to be closed). > The fix here is pretty simple: add proper resource cleanup to this class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6513) Regression - missing zipWithUniqueId (and other RDD APIs) in RDDApi.scala
[ https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eran Medan updated SPARK-6513: -- Summary: Regression - missing zipWithUniqueId (and other RDD APIs) in RDDApi.scala (was: Regression - Adding zipWithUniqueId (and other missing RDD APIs) to RDDApi.scala) > Regression - missing zipWithUniqueId (and other RDD APIs) in RDDApi.scala > - > > Key: SPARK-6513 > URL: https://issues.apache.org/jira/browse/SPARK-6513 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I > don't think it's relevant) >Reporter: Eran Medan >Priority: Blocker > > I'm sure this has an Issue somewhere but I can't find it. > I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 > without any earlier deprecation warnings, but I am sure the authors are well > aware, so please change it to an enhancement request if you disagree this is > a regression. It's such an obvious and blunt regression that I doubt it was > done without a lot of thought and I'm sure there was a good reason, but still > it breaks my code and I don't have a workaround :) > Here are the details / steps to reproduce > *Worked in 1.2.1* (without any deprecation warnings) > {code} > val sqlContext = new HiveContext(sc) > import sqlContext._ > val jsonRDD = sqlContext.jsonFile(jsonFilePath) > jsonRDD.registerTempTable("jsonTable") > val jsonResult = sql(s"select * from jsonTable") > val foo = jsonResult.zipWithUniqueId().map { >case (Row(...), uniqueId) => // do something useful >... > } > foo.registerTempTable("...") > {code} > *Stopped working in 1.3.0* (simply does not compile, and all I did was change > to 1.3) > {code} > jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method > {code} > **Not working workaround:** > although this might give me an {{RDD\[Row\]}}: > {code} > jsonResult.map(identity).zipWithUniqueId() > {code} > Now this won't work obviously since {{RDD\[Row\]}} does not have a > {{registerTempTable}} method of course > {code} > foo.registerTempTable("...") > {code} > (see related SO question: > http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6513) Regression - Adding zipWithUniqueId (and other missing RDD APIs) to RDDApi.scala
[ https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eran Medan updated SPARK-6513: -- Description: I'm sure this has an Issue somewhere but I can't find it. I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 without any earlier deprecation warnings, but I am sure the authors are well aware, so please change it to an enhancement request if you disagree this is a regression. It's such an obvious and blunt regression that I doubt it was done without a lot of thought and I'm sure there was a good reason, but still it breaks my code and I don't have a workaround :) Here are the details / steps to reproduce *Worked in 1.2.1* (without any deprecation warnings) {code} val sqlContext = new HiveContext(sc) import sqlContext._ val jsonRDD = sqlContext.jsonFile(jsonFilePath) jsonRDD.registerTempTable("jsonTable") val jsonResult = sql(s"select * from jsonTable") val foo = jsonResult.zipWithUniqueId().map { case (Row(...), uniqueId) => // do something useful ... } foo.registerTempTable("...") {code} *Stopped working in 1.3.0* (simply does not compile, and all I did was change to 1.3) {code} jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method {code} **Not working workaround:** although this might give me an {{RDD\[Row\]}}: {code} jsonResult.map(identity).zipWithUniqueId() {code} Now this won't work obviously since {{RDD\[Row\]}} does not have a {{registerTempTable}} method of course {code} foo.registerTempTable("...") {code} (see related SO question: http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3) was: I'm sure this has an Issue somewhere but I can't find it. I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 without any earlier deprecation warnings, but I am sure the authors are well aware, so please change it to an enhancement request if you disagree this is a regression. It's such an obvious and blunt regression that I doubt it was done without a lot of thought and I'm sure there was a good reason, but still it breaks my code and I don't have a workaround :) Here are the details / steps to reproduce **Worked in 1.2.1** (without any deprecation warnings) val sqlContext = new HiveContext(sc) import sqlContext._ val jsonRDD = sqlContext.jsonFile(jsonFilePath) jsonRDD.registerTempTable("jsonTable") val jsonResult = sql(s"select * from jsonTable") val foo = jsonResult.zipWithUniqueId().map { case (Row(...), uniqueId) => // do something useful ... } foo.registerTempTable("...") **Stopped working in 1.3.0** (simply does not compile, and all I did was change to 1.3) jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method **Not working workaround:** although this might give me an RDD[Row]: jsonResult.map(identity).zipWithUniqueId() now this won't work as `RDD[Row]` does not have a `registerTempTable` method of course foo.registerTempTable("...") (see related SO question: http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3) > Regression - Adding zipWithUniqueId (and other missing RDD APIs) to > RDDApi.scala > > > Key: SPARK-6513 > URL: https://issues.apache.org/jira/browse/SPARK-6513 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I > don't think it's relevant) >Reporter: Eran Medan >Priority: Blocker > > I'm sure this has an Issue somewhere but I can't find it. > I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 > without any earlier deprecation warnings, but I am sure the authors are well > aware, so please change it to an enhancement request if you disagree this is > a regression. It's such an obvious and blunt regression that I doubt it was > done without a lot of thought and I'm sure there was a good reason, but still > it breaks my code and I don't have a workaround :) > Here are the details / steps to reproduce > *Worked in 1.2.1* (without any deprecation warnings) > {code} > val sqlContext = new HiveContext(sc) > import sqlContext._ > val jsonRDD = sqlContext.jsonFile(jsonFilePath) > jsonRDD.registerTempTable("jsonTable") > val jsonResult = sql(s"select * from jsonTable") > val foo = jsonResult.zipWithUniqueId().map { >case (Row(...), uniqueId) => // do something useful >... > } > foo.registerTempTable("...") > {code} > *Stopped working in 1.3.0* (simply does not compile, and all I did was chan
[jira] [Updated] (SPARK-6513) Regression - Adding zipWithUniqueId (and other missing RDD APIs) to RDDApi.scala
[ https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eran Medan updated SPARK-6513: -- Priority: Blocker (was: Major) > Regression - Adding zipWithUniqueId (and other missing RDD APIs) to > RDDApi.scala > > > Key: SPARK-6513 > URL: https://issues.apache.org/jira/browse/SPARK-6513 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I > don't think it's relevant) >Reporter: Eran Medan >Priority: Blocker > > I'm sure this has an Issue somewhere but I can't find it. > I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 > without any earlier deprecation warnings, but I am sure the authors are well > aware, so please change it to an enhancement request if you disagree this is > a regression. It's such an obvious and blunt regression that I doubt it was > done without a lot of thought and I'm sure there was a good reason, but still > it breaks my code and I don't have a workaround :) > Here are the details / steps to reproduce > **Worked in 1.2.1** (without any deprecation warnings) > val sqlContext = new HiveContext(sc) > import sqlContext._ > val jsonRDD = sqlContext.jsonFile(jsonFilePath) > jsonRDD.registerTempTable("jsonTable") > val jsonResult = sql(s"select * from jsonTable") > val foo = jsonResult.zipWithUniqueId().map { >case (Row(...), uniqueId) => // do something useful >... > } > foo.registerTempTable("...") > **Stopped working in 1.3.0** (simply does not compile, and all I did was > change to 1.3) > > jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method > **Not working workaround:** > although this might give me an RDD[Row]: > jsonResult.map(identity).zipWithUniqueId() > now this won't work as `RDD[Row]` does not have a `registerTempTable` method > of course > foo.registerTempTable("...") > (see related SO question: > http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6079) Use index to speed up StatusTracker.getJobIdsForGroup()
[ https://issues.apache.org/jira/browse/SPARK-6079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6079: - Affects Version/s: 1.3.0 > Use index to speed up StatusTracker.getJobIdsForGroup() > --- > > Key: SPARK-6079 > URL: https://issues.apache.org/jira/browse/SPARK-6079 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > Fix For: 1.4.0 > > > {{StatusTracker.getJobIdsForGroup()}} is implemented via a linear scan over a > HashMap rather than using an index. This might be an expensive operation if > there are many (e.g. thousands) of retained jobs. We can add a new index to > speed this up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6513) Regression - Adding zipWithUniqueId (and other missing RDD APIs) to RDDApi.scala
[ https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eran Medan updated SPARK-6513: -- Summary: Regression - Adding zipWithUniqueId (and other missing RDD APIs) to RDDApi.scala (was: Regression Adding zipWithUniqueId (and other missing RDD APIs) to RDDApi.scala) > Regression - Adding zipWithUniqueId (and other missing RDD APIs) to > RDDApi.scala > > > Key: SPARK-6513 > URL: https://issues.apache.org/jira/browse/SPARK-6513 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I > don't think it's relevant) >Reporter: Eran Medan > > I'm sure this has an Issue somewhere but I can't find it. > I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 > without any earlier deprecation warnings, but I am sure the authors are well > aware, so please change it to an enhancement request if you disagree this is > a regression. It's such an obvious and blunt regression that I doubt it was > done without a lot of thought and I'm sure there was a good reason, but still > it breaks my code and I don't have a workaround :) > Here are the details / steps to reproduce > **Worked in 1.2.1** (without any deprecation warnings) > val sqlContext = new HiveContext(sc) > import sqlContext._ > val jsonRDD = sqlContext.jsonFile(jsonFilePath) > jsonRDD.registerTempTable("jsonTable") > val jsonResult = sql(s"select * from jsonTable") > val foo = jsonResult.zipWithUniqueId().map { >case (Row(...), uniqueId) => // do something useful >... > } > foo.registerTempTable("...") > **Stopped working in 1.3.0** (simply does not compile, and all I did was > change to 1.3) > > jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method > **Not working workaround:** > although this might give me an RDD[Row]: > jsonResult.map(identity).zipWithUniqueId() > now this won't work as `RDD[Row]` does not have a `registerTempTable` method > of course > foo.registerTempTable("...") > (see related SO question: > http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6513) Regression Adding zipWithUniqueId (and other missing RDD APIs) to RDDApi.scala
Eran Medan created SPARK-6513: - Summary: Regression Adding zipWithUniqueId (and other missing RDD APIs) to RDDApi.scala Key: SPARK-6513 URL: https://issues.apache.org/jira/browse/SPARK-6513 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I don't think it's relevant) Reporter: Eran Medan I'm sure this has an Issue somewhere but I can't find it. I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 without any earlier deprecation warnings, but I am sure the authors are well aware, so please change it to an enhancement request if you disagree this is a regression. It's such an obvious and blunt regression that I doubt it was done without a lot of thought and I'm sure there was a good reason, but still it breaks my code and I don't have a workaround :) Here are the details / steps to reproduce **Worked in 1.2.1** (without any deprecation warnings) val sqlContext = new HiveContext(sc) import sqlContext._ val jsonRDD = sqlContext.jsonFile(jsonFilePath) jsonRDD.registerTempTable("jsonTable") val jsonResult = sql(s"select * from jsonTable") val foo = jsonResult.zipWithUniqueId().map { case (Row(...), uniqueId) => // do something useful ... } foo.registerTempTable("...") **Stopped working in 1.3.0** (simply does not compile, and all I did was change to 1.3) jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method **Not working workaround:** although this might give me an RDD[Row]: jsonResult.map(identity).zipWithUniqueId() now this won't work as `RDD[Row]` does not have a `registerTempTable` method of course foo.registerTempTable("...") (see related SO question: http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6209) ExecutorClassLoader can leak connections after failing to load classes from the REPL class server
[ https://issues.apache.org/jira/browse/SPARK-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6209: - Fix Version/s: 1.4.0 1.3.1 > ExecutorClassLoader can leak connections after failing to load classes from > the REPL class server > - > > Key: SPARK-6209 > URL: https://issues.apache.org/jira/browse/SPARK-6209 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0, 1.0.3, 1.1.2, 1.2.1, 1.3.0, 1.4.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.3.1, 1.4.0 > > > ExecutorClassLoader does not ensure proper cleanup of network connections > that it opens. If it fails to load a class, it may leak partially-consumed > InputStreams that are connected to the REPL's HTTP class server, causing that > server to exhaust its thread pool, which can cause the entire job to hang. > Here is a simple reproduction: > With > {code} > ./bin/spark-shell --master local-cluster[8,8,512] > {code} > run the following command: > {code} > sc.parallelize(1 to 1000, 1000).map { x => > try { > Class.forName("some.class.that.does.not.Exist") > } catch { > case e: Exception => // do nothing > } > x > }.count() > {code} > This job will run 253 tasks, then will completely freeze without any errors > or failed tasks. > It looks like the driver has 253 threads blocked in socketRead0() calls: > {code} > [joshrosen ~]$ jstack 16765 | grep socketRead0 | wc > 253 759 14674 > {code} > e.g. > {code} > "qtp1287429402-13" daemon prio=5 tid=0x7f868a1c nid=0x5b03 runnable > [0x0001159bd000] >java.lang.Thread.State: RUNNABLE > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:152) > at java.net.SocketInputStream.read(SocketInputStream.java:122) > at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391) > at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141) > at > org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227) > at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1044) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280) > at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) > at > org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) > at > org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > at java.lang.Thread.run(Thread.java:745) > {code} > Jstack on the executors shows blocking in loadClass / findClass, where a > single thread is RUNNABLE and waiting to hear back from the driver and other > executor threads are BLOCKED on object monitor synchronization at > Class.forName0(). > Remotely triggering a GC on a hanging executor allows the job to progress and > complete more tasks before hanging again. If I repeatedly trigger GC on all > of the executors, then the job runs to completion: > {code} > jps | grep CoarseGra | cut -d ' ' -f 1 | xargs -I {} -n 1 -P100 jcmd {} GC.run > {code} > The culprit is a {{catch}} block that ignores all exceptions and performs no > cleanup: > https://github.com/apache/spark/blob/v1.2.0/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L94 > This bug has been present since Spark 1.0.0, but I suspect that we haven't > seen it before because it's pretty hard to reproduce. Triggering this error > requires a job with tasks that trigger ClassNotFoundExceptions yet are still > able to run to completion. It also requires that executors are able to leak > enough open connections to exhaust the class server's Jetty thread pool > limit, which requires that there are a large number of tasks (253+) and > either a large number of executors or a very low amount of GC pressure on > those executors (since GC will cause the leaked connections to be closed). > The fix here is pretty simple: add proper resource cleanup to this class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6385) ISO 8601 timestamp parsing does not support arbitrary precision second fractions
[ https://issues.apache.org/jira/browse/SPARK-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378689#comment-14378689 ] Michael Armbrust commented on SPARK-6385: - [~bruun] I think all you need to do is add another S to the format spec. {code} scala> val ISO8601GMT: SimpleDateFormat = new SimpleDateFormat( "-MM-dd'T'HH:mm:ss.SSSz" ) ISO8601GMT: java.text.SimpleDateFormat = java.text.SimpleDateFormat@8a9df61b scala> ISO8601GMT.parse("2015-02-02T00:00:07.900GMT-00:00") res0: java.util.Date = Sun Feb 01 16:00:07 PST 2015 scala> ISO8601GMT.parse("2015-02-02T00:00:07.9000GMT-00:00") java.text.ParseException: Unparseable date: "2015-02-02T00:00:07.9000GMT-00:00" at java.text.DateFormat.parse(DateFormat.java:357) at .(:10) at .() at .(:7) at .() at $print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734) at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983) at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568) at scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:756) at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:801) at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:713) at scala.tools.nsc.interpreter.ILoop.processLine$1(ILoop.scala:577) at scala.tools.nsc.interpreter.ILoop.innerLoop$1(ILoop.scala:584) at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:587) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:878) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:833) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:833) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:833) at scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:83) at scala.tools.nsc.MainGenericRunner.process(MainGenericRunner.scala:96) at scala.tools.nsc.MainGenericRunner$.main(MainGenericRunner.scala:105) at scala.tools.nsc.MainGenericRunner.main(MainGenericRunner.scala) scala> val ISO8601GMT: SimpleDateFormat = new SimpleDateFormat( "-MM-dd'T'HH:mm:ss.z" ) ISO8601GMT: java.text.SimpleDateFormat = java.text.SimpleDateFormat@c920c906 scala> ISO8601GMT.parse("2015-02-02T00:00:07.9000GMT-00:00") res2: java.util.Date = Sun Feb 01 16:00:16 PST 2015 scala> ISO8601GMT.parse("2015-02-02T00:00:07.900GMT-00:00") res3: java.util.Date = Sun Feb 01 16:00:07 PST 2015 {code} > ISO 8601 timestamp parsing does not support arbitrary precision second > fractions > > > Key: SPARK-6385 > URL: https://issues.apache.org/jira/browse/SPARK-6385 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 >Reporter: Nick Bruun >Priority: Minor > > The ISO 8601 timestamp parsing implemented as a resolution to SPARK-4149 does > not support arbitrary precision fractions of seconds, only millisecond > precision. Parsing {{2015-02-02T00:00:07.900GMT-00:00}} will succeed, while > {{2015-02-02T00:00:07.9000GMT-00:00}} will fail. > The issue is caused by the fixed precision of the parsed format in > [DataTypeConversions.scala#L66|https://github.com/apache/spark/blob/84acd08e0886aa23195f35837c15c09aa7804aff/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataTypeConversions.scala#L66]. > I'm willing to implement a fix, but pointers on the direction would be > appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6385) ISO 8601 timestamp parsing does not support arbitrary precision second fractions
[ https://issues.apache.org/jira/browse/SPARK-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378690#comment-14378690 ] Michael Armbrust commented on SPARK-6385: - A PR would be great. Let me know if you have questions. > ISO 8601 timestamp parsing does not support arbitrary precision second > fractions > > > Key: SPARK-6385 > URL: https://issues.apache.org/jira/browse/SPARK-6385 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 >Reporter: Nick Bruun >Priority: Minor > > The ISO 8601 timestamp parsing implemented as a resolution to SPARK-4149 does > not support arbitrary precision fractions of seconds, only millisecond > precision. Parsing {{2015-02-02T00:00:07.900GMT-00:00}} will succeed, while > {{2015-02-02T00:00:07.9000GMT-00:00}} will fail. > The issue is caused by the fixed precision of the parsed format in > [DataTypeConversions.scala#L66|https://github.com/apache/spark/blob/84acd08e0886aa23195f35837c15c09aa7804aff/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataTypeConversions.scala#L66]. > I'm willing to implement a fix, but pointers on the direction would be > appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6380) Resolution of equi-join key in post-join projection
[ https://issues.apache.org/jira/browse/SPARK-6380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6380: Target Version/s: 1.4.0 > Resolution of equi-join key in post-join projection > --- > > Key: SPARK-6380 > URL: https://issues.apache.org/jira/browse/SPARK-6380 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > > {code} > df1.join(df2, df1("key") === df2("key")).select("key") > {code} > It would be great to just resolve key to df1("key") in the case of inner > joins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org