[jira] [Updated] (SPARK-6520) Kyro serialization broken in the shell

2015-03-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6520:
---
Component/s: Spark Shell

> Kyro serialization broken in the shell
> --
>
> Key: SPARK-6520
> URL: https://issues.apache.org/jira/browse/SPARK-6520
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.3.0
>Reporter: Aaron Defazio
>
> If I start spark as follows:
> {quote}
> ~/spark-1.3.0-bin-hadoop2.4/bin/spark-shell --master local[1] --conf 
> "spark.serializer=org.apache.spark.serializer.KryoSerializer"
> {quote}
> Then using :paste, run 
> {quote}
> case class Example(foo : String, bar : String)
> val ex = sc.parallelize(List(Example("foo1", "bar1"), Example("foo2", 
> "bar2"))).collect()
> {quote}
> I get the error:
> {quote}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.io.IOException: 
> com.esotericsoftware.kryo.KryoException: Error constructing instance of 
> class: $line3.$read
> Serialization trace:
> $VAL10 ($iwC)
> $outer ($iwC$$iwC)
> $outer ($iwC$$iwC$Example)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1140)
>   at 
> org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:979)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1873)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> {quote}
> As far as I can tell, when using :paste, Kyro serialization doesn't work for 
> classes defined in within the same paste. It does work when the statements 
> are entered without paste.
> This issue seems serious to me, since Kyro serialization is virtually 
> mandatory for performance (20x slower with default serialization on my 
> problem), and I'm assuming feature parity between spark-shell and 
> spark-submit is a goal.
> Note that this is different from SPARK-6497, which covers the case when Kyro 
> is set to require registration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6499) pyspark: printSchema command on a dataframe hangs

2015-03-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6499:
---
Component/s: PySpark

> pyspark: printSchema command on a dataframe hangs
> -
>
> Key: SPARK-6499
> URL: https://issues.apache.org/jira/browse/SPARK-6499
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: cynepia
> Attachments: airports.json, pyspark.txt
>
>
> 1. A printSchema() on a dataframe fails to respond even after a lot of time
> Will attach the console logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6504) Cannot read Parquet files generated from different versions at once

2015-03-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6504:
---
Component/s: SQL

> Cannot read Parquet files generated from different versions at once
> ---
>
> Key: SPARK-6504
> URL: https://issues.apache.org/jira/browse/SPARK-6504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Marius Soutier
>
> When trying to read Parquet files generated by Spark 1.1.1 and 1.2.1 at the 
> same time via 
> `sqlContext.parquetFile("fileFrom1.1.parqut,fileFrom1.2.parquet")` an 
> exception occurs:
> could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has 
> conflicting values: 
> [{"type":"struct","fields":[{"name":"date","type":"string","nullable":true,"metadata":{}},{"name":"account","type":"string","nullable":true,"metadata":{}},{"name":"impressions","type":"long","nullable":false,"metadata":{}},{"name":"cost","type":"double","nullable":false,"metadata":{}},{"name":"clicks","type":"long","nullable":false,"metadata":{}},{"name":"conversions","type":"long","nullable":false,"metadata":{}},{"name":"orderValue","type":"double","nullable":false,"metadata":{}}]},
>  StructType(List(StructField(date,StringType,true), 
> StructField(account,StringType,true), 
> StructField(impressions,LongType,false), StructField(cost,DoubleType,false), 
> StructField(clicks,LongType,false), StructField(conversions,LongType,false), 
> StructField(orderValue,DoubleType,false)))]
> The Schema is exactly equal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6530) ChiSqSelector transformer

2015-03-24 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-6530:


 Summary: ChiSqSelector transformer
 Key: SPARK-6530
 URL: https://issues.apache.org/jira/browse/SPARK-6530
 Project: Spark
  Issue Type: Sub-task
Reporter: Xusen Yin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6529) Word2Vec transformer

2015-03-24 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-6529:


 Summary: Word2Vec transformer
 Key: SPARK-6529
 URL: https://issues.apache.org/jira/browse/SPARK-6529
 Project: Spark
  Issue Type: Sub-task
Reporter: Xusen Yin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6528) IDF transformer

2015-03-24 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-6528:


 Summary: IDF transformer
 Key: SPARK-6528
 URL: https://issues.apache.org/jira/browse/SPARK-6528
 Project: Spark
  Issue Type: Sub-task
Reporter: Xusen Yin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6495) DataFrame#insertInto method should support insert rows with sub-columns

2015-03-24 Thread Chaozhong Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379243#comment-14379243
 ] 

Chaozhong Yang edited comment on SPARK-6495 at 3/25/15 6:31 AM:


Thanks! Maybe what you point at is the resolved issue 
https://issues.apache.org/jira/browse/SPARK-3851. Reading data from  parquet 
files with different but compatible schemas  has been supported in Spark 1.3.0. 
 

https://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging


was (Author: debugger87):
Thanks! Maybe what you point at is the resolved issue 
https://issues.apache.org/jira/browse/SPARK-3851. Reading data from  parquet 
files with different but compatible schemas  has been support in Spark 1.3.0.  

https://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging

> DataFrame#insertInto method should support insert rows with sub-columns
> ---
>
> Key: SPARK-6495
> URL: https://issues.apache.org/jira/browse/SPARK-6495
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Chaozhong Yang
>
> The original table's schema is like this:
>  |-- a: string (nullable = true)
>  |-- b: string (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: string (nullable = true)
> If we want to insert one row(can be transformed into DataFrame) with this 
> schema:
>  |-- a: string (nullable = true)
>  |-- b: string (nullable = true)
>  |-- c: string (nullable = true)
> Of course, that operation will fail. Actually, in many cases, people need to 
> insert new rows with columns which is the subset of original table columns. 
> If we can support and fix those issue, Spark SQL's insertion can be more 
> valuable to users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6495) DataFrame#insertInto method should support insert rows with sub-columns

2015-03-24 Thread Chaozhong Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chaozhong Yang closed SPARK-6495.
-
Resolution: Not a Problem

> DataFrame#insertInto method should support insert rows with sub-columns
> ---
>
> Key: SPARK-6495
> URL: https://issues.apache.org/jira/browse/SPARK-6495
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Chaozhong Yang
>
> The original table's schema is like this:
>  |-- a: string (nullable = true)
>  |-- b: string (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: string (nullable = true)
> If we want to insert one row(can be transformed into DataFrame) with this 
> schema:
>  |-- a: string (nullable = true)
>  |-- b: string (nullable = true)
>  |-- c: string (nullable = true)
> Of course, that operation will fail. Actually, in many cases, people need to 
> insert new rows with columns which is the subset of original table columns. 
> If we can support and fix those issue, Spark SQL's insertion can be more 
> valuable to users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6526) Add Normalizer transformer

2015-03-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379371#comment-14379371
 ] 

Apache Spark commented on SPARK-6526:
-

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5181

> Add Normalizer transformer
> --
>
> Key: SPARK-6526
> URL: https://issues.apache.org/jira/browse/SPARK-6526
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6526) Add Normalizer transformer

2015-03-24 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-6526:
-
Description: https://github.com/apache/spark/pull/5181

> Add Normalizer transformer
> --
>
> Key: SPARK-6526
> URL: https://issues.apache.org/jira/browse/SPARK-6526
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
> Fix For: 1.4.0
>
>
> https://github.com/apache/spark/pull/5181



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6527) sc.binaryFiles can not access files on s3

2015-03-24 Thread Zhao Zhang (JIRA)
Zhao Zhang created SPARK-6527:
-

 Summary: sc.binaryFiles can not access files on s3
 Key: SPARK-6527
 URL: https://issues.apache.org/jira/browse/SPARK-6527
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.3.0, 1.2.0
 Environment: I am running Spark on EC2
Reporter: Zhao Zhang


The sc.binaryFIles() can not access the files stored on s3. It can correctly 
list the number of files, but report "file does not exist" when processing 
them. I also tried sc.textFile() which works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6525) Add new feature transformers in ML package

2015-03-24 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379358#comment-14379358
 ] 

Xusen Yin commented on SPARK-6525:
--

[~mengxr] Let's add new feature transformers. I will try to write the 
Normalizer transformer first.

> Add new feature transformers in ML package
> --
>
> Key: SPARK-6525
> URL: https://issues.apache.org/jira/browse/SPARK-6525
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Xusen Yin
>  Labels: features
> Fix For: 1.4.0
>
>
> New feature transformers should be added to ML package to make assembling ML 
> pipeline more easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6526) Add Normalizer transformer

2015-03-24 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-6526:


 Summary: Add Normalizer transformer
 Key: SPARK-6526
 URL: https://issues.apache.org/jira/browse/SPARK-6526
 Project: Spark
  Issue Type: Sub-task
Reporter: Xusen Yin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6525) Add new feature transformers in ML package

2015-03-24 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-6525:


 Summary: Add new feature transformers in ML package
 Key: SPARK-6525
 URL: https://issues.apache.org/jira/browse/SPARK-6525
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.3.0
Reporter: Xusen Yin
 Fix For: 1.4.0


New feature transformers should be added to ML package to make assembling ML 
pipeline more easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6485) Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark

2015-03-24 Thread Meethu Mathew (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379333#comment-14379333
 ] 

Meethu Mathew commented on SPARK-6485:
--

As you had mentioned here https://issues.apache.org/jira/browse/SPARK-6100, 
MatrixUDT has been merged. But MatrixUDT for PySpark seems to be under 
progress. Does https://issues.apache.org/jira/browse/SPARK-6390 block this task?

> Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark
> --
>
> Key: SPARK-6485
> URL: https://issues.apache.org/jira/browse/SPARK-6485
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>
> We should add APIs for CoordinateMatrix/RowMatrix/IndexedRowMatrix in 
> PySpark. Internally, we can use DataFrames for serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6465) GenericRowWithSchema: KryoException: Class cannot be created (missing no-arg constructor):

2015-03-24 Thread Earthson Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377490#comment-14377490
 ] 

Earthson Lu edited comment on SPARK-6465 at 3/25/15 5:26 AM:
-

I'm confused.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L94

{code}
  def convertRowToScala(r: Row, schema: StructType): Row = {
// TODO: This is very slow!!!
new GenericRowWithSchema( //Why we need GenericRowWithSchema? It seems to 
be the only use of GenericRowWithSchema
  r.toSeq.zip(schema.fields.map(_.dataType))
.map(r_dt => convertToScala(r_dt._1, r_dt._2)).toArray, schema)
  }
{code}


was (Author: earthsonlu):
I'm confused.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L94

{code:scala}
  def convertRowToScala(r: Row, schema: StructType): Row = {
// TODO: This is very slow!!!
new GenericRowWithSchema( //Why we need GenericRowWithSchema? It seems to 
be the only use of GenericRowWithSchema
  r.toSeq.zip(schema.fields.map(_.dataType))
.map(r_dt => convertToScala(r_dt._1, r_dt._2)).toArray, schema)
  }
{code}

> GenericRowWithSchema: KryoException: Class cannot be created (missing no-arg 
> constructor):
> --
>
> Key: SPARK-6465
> URL: https://issues.apache.org/jira/browse/SPARK-6465
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Spark 1.3, YARN 2.6.0, CentOS
>Reporter: Earthson Lu
>Assignee: Michael Armbrust
>Priority: Critical
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> I can not find a issue for this. 
> register for GenericRowWithSchema is lost in  
> org.apache.spark.sql.execution.SparkSqlSerializer.
> Is this the only thing we need to do?
> Here is the log
> {code}
> 15/03/23 16:21:00 WARN TaskSetManager: Lost task 9.0 in stage 20.0 (TID 
> 31978, datanode06.site): com.esotericsoftware.kryo.KryoException: Class 
> cannot be created (missing no-arg constructor): 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
> at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1050)
> at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1062)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:228)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:217)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
> at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:138)
> at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:66)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (SPARK-6465) GenericRowWithSchema: KryoException: Class cannot be created (missing no-arg constructor):

2015-03-24 Thread Earthson Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377490#comment-14377490
 ] 

Earthson Lu edited comment on SPARK-6465 at 3/25/15 5:25 AM:
-

I'm confused.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L94

{code:scala}
  def convertRowToScala(r: Row, schema: StructType): Row = {
// TODO: This is very slow!!!
new GenericRowWithSchema( //Why we need GenericRowWithSchema? It seems to 
be the only use of GenericRowWithSchema
  r.toSeq.zip(schema.fields.map(_.dataType))
.map(r_dt => convertToScala(r_dt._1, r_dt._2)).toArray, schema)
  }
{code}


was (Author: earthsonlu):
I'm confused.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L94

```scala
  def convertRowToScala(r: Row, schema: StructType): Row = {
// TODO: This is very slow!!!
new GenericRowWithSchema( //Why we need GenericRowWithSchema? It seems to 
be the only use of GenericRowWithSchema
  r.toSeq.zip(schema.fields.map(_.dataType))
.map(r_dt => convertToScala(r_dt._1, r_dt._2)).toArray, schema)
  }
```

> GenericRowWithSchema: KryoException: Class cannot be created (missing no-arg 
> constructor):
> --
>
> Key: SPARK-6465
> URL: https://issues.apache.org/jira/browse/SPARK-6465
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Spark 1.3, YARN 2.6.0, CentOS
>Reporter: Earthson Lu
>Assignee: Michael Armbrust
>Priority: Critical
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> I can not find a issue for this. 
> register for GenericRowWithSchema is lost in  
> org.apache.spark.sql.execution.SparkSqlSerializer.
> Is this the only thing we need to do?
> Here is the log
> {code}
> 15/03/23 16:21:00 WARN TaskSetManager: Lost task 9.0 in stage 20.0 (TID 
> 31978, datanode06.site): com.esotericsoftware.kryo.KryoException: Class 
> cannot be created (missing no-arg constructor): 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
> at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1050)
> at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1062)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:228)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:217)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
> at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:138)
> at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:66)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-

[jira] [Created] (SPARK-6524) Problem connecting JAVA API to Spark Yarn Cluster or yarn Client

2015-03-24 Thread milan.b (JIRA)
milan.b created SPARK-6524:
--

 Summary: Problem connecting JAVA API to Spark Yarn Cluster or yarn 
Client
 Key: SPARK-6524
 URL: https://issues.apache.org/jira/browse/SPARK-6524
 Project: Spark
  Issue Type: Question
Affects Versions: 1.2.0
 Environment: Ubuntu 14.10 
java JDK  1.8 

Reporter: milan.b


Hi Team,

I am trying to submit a spark job to yarn-cluster or yarn client using 
Java API but I was unable to do so. Following is the configuration  code
   
System.setProperty("SPARK_YARN_MODE", "true");
SparkConf sparkYarnConf = new SparkConf().setAppName("Spark yarn")
.setMaster("yarn-cluster")
.set("spark.executor.memory", "258m")
.set("spark.driver.memory", "2588m")
.set("spark.yarn.app.id", "append");
ClientArguments clientArgs =new ClientArguments(args, 
sparkYarnConf);

Error :

org.apache.spark.util.Utils] (MSC service thread 1-2) Service 'SparkUI' could 
not bind on port 4043. Attempting port 4044.
20:57:41,808 ERROR [stderr] (MSC service thread 1-2) 15/03/24 20:57:41 WARN 
Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
20:57:41,842 INFO  [org.apache.spark.util.Utils] (MSC service thread 1-2) 
Successfully started service 'SparkUI' on port 4044.
20:57:41,846 ERROR [stderr] (MSC service thread 1-2) 15/03/24 20:57:41 INFO 
Utils: Successfully started service 'SparkUI' on port 4044.
20:57:41,850 INFO  [org.apache.spark.ui.SparkUI] (MSC service thread 1-2) 
Started SparkUI at http://D7271:4044
20:57:41,850 ERROR [stderr] (MSC service thread 1-2) 15/03/24 20:57:41 INFO 
SparkUI: Started SparkUI at http://D7271:4044
20:57:41,968 INFO  [org.apache.spark.scheduler.cluster.YarnClusterScheduler] 
(MSC service thread 1-2) Created YarnClusterScheduler
20:57:41,968 ERROR [stderr] (MSC service thread 1-2) 15/03/24 20:57:41 INFO 
YarnClusterScheduler: Created YarnClusterScheduler
20:57:42,243 INFO  [org.apache.spark.network.netty.NettyBlockTransferService] 
(MSC service thread 1-2) Server created on 45902
20:57:42,243 ERROR [stderr] (MSC service thread 1-2) 15/03/24 20:57:42 INFO 
NettyBlockTransferService: Server created on 45902
20:57:42,247 INFO  [org.apache.spark.storage.BlockManagerMaster] (MSC service 
thread 1-2) Trying to register BlockManager
20:57:42,247 ERROR [stderr] (MSC service thread 1-2) 15/03/24 20:57:42 INFO 
BlockManagerMaster: Trying to register BlockManager
20:57:42,264 INFO  [org.apache.spark.storage.BlockManagerMasterActor] 
(sparkDriver-akka.actor.default-dispatcher-2) Registering block manager 
D7271:45902 with 246.0 MB RAM, BlockManagerId(, D7271, 45902)
20:57:42,265 ERROR [stderr] (sparkDriver-akka.actor.default-dispatcher-2) 
15/03/24 20:57:42 INFO BlockManagerMasterActor: Registering block manager 
D7271:45902 with 246.0 MB RAM, BlockManagerId(, D7271, 45902)
20:57:42,272 INFO  [org.apache.spark.storage.BlockManagerMaster] (MSC service 
thread 1-2) Registered BlockManager
20:57:42,276 ERROR [stderr] (MSC service thread 1-2) 15/03/24 20:57:42 INFO 
BlockManagerMaster: Registered BlockManager
20:57:42,330 ERROR [org.springframework.web.context.ContextLoader] (MSC service 
thread 1-2) Context initialization failed: 
org.springframework.beans.factory.BeanCreationException: Error creating bean 
with name 'exampleInitBean' defined in ServletContext resource 
[/WEB-INF/mvc-dispatcher-servlet.xml]: Invocation of init method failed; nested 
exception is java.lang.NullPointerException
at 
org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1553)
 [spring-beans-4.0.6.RELEASE.jar:4.0.6.RELEASE]
at 
org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:539)
 [spring-beans-4.0.6.RELEASE.jar:4.0.6.RELEASE]
at 
org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:475)
 [spring-beans-4.0.6.RELEASE.jar:4.0.6.RELEASE]
at 
org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:302)
 [spring-beans-4.0.6.RELEASE.jar:4.0.6.RELEASE]
at 
org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:228)
 [spring-beans-4.0.6.RELEASE.jar:4.0.6.RELEASE]
at 
org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:298)
 [spring-beans-4.0.6.RELEASE.jar:4.0.6.RELEASE]
at 
org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:193)
 [spring-beans-4.0.6.RELEASE.jar:4.0.6.RELEASE]
at 
org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:703)
 [spring-beans-4.0.6.RELEA

[jira] [Created] (SPARK-6523) Error when get attribute of StandardScalerModel, When use python api

2015-03-24 Thread lee.xiaobo.2006 (JIRA)
lee.xiaobo.2006 created SPARK-6523:
--

 Summary: Error when get attribute of StandardScalerModel, When use 
python api
 Key: SPARK-6523
 URL: https://issues.apache.org/jira/browse/SPARK-6523
 Project: Spark
  Issue Type: Bug
  Components: Examples, MLlib, PySpark
Affects Versions: 1.3.0
Reporter: lee.xiaobo.2006


test code
===
from pyspark.mllib.util import MLUtils
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.feature import StandardScaler
conf = SparkConf().setAppName('Test')
sc = SparkContext(conf=conf)
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
label = data.map(lambda x: x.label)
features = data.map(lambda x: x.features)
scaler1 = StandardScaler().fit(features)
print scaler1.std   # error
sc.stop()

---
error:

Traceback (most recent call last):
  File "/data1/s/apps/spark-app/app/test_ssm.py", line 22, in 
print scaler1.std
AttributeError: 'StandardScalerModel' object has no attribute 'std'
15/03/25 12:17:28 INFO Utils: path = 
/data1/s/apps/spark-1.4.0-SNAPSHOT/data/spark-eb1ed7c0-a5ce-4748-a817-3cb0687ee282/blockmgr-5398b477-127d-4259-a71b-608a324e1cd3,
 already present as root for deletion.


=
Another question, how to serialize or save the scaler model ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6495) DataFrame#insertInto method should support insert rows with sub-columns

2015-03-24 Thread Chaozhong Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379243#comment-14379243
 ] 

Chaozhong Yang edited comment on SPARK-6495 at 3/25/15 4:08 AM:


Thanks! Maybe what you point at is the resolved issue 
https://issues.apache.org/jira/browse/SPARK-3851. Reading data from  parquet 
files with different but compatible schemas  has been support in Spark 1.3.0.  

https://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging


was (Author: debugger87):
Thanks! Maybe what you point at is the resolved issue 
https://issues.apache.org/jira/browse/SPARK-3851. Reading data from  parquet 
files with different but compatible schemas  has been support in Spark 1.3.0. 

> DataFrame#insertInto method should support insert rows with sub-columns
> ---
>
> Key: SPARK-6495
> URL: https://issues.apache.org/jira/browse/SPARK-6495
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Chaozhong Yang
>
> The original table's schema is like this:
>  |-- a: string (nullable = true)
>  |-- b: string (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: string (nullable = true)
> If we want to insert one row(can be transformed into DataFrame) with this 
> schema:
>  |-- a: string (nullable = true)
>  |-- b: string (nullable = true)
>  |-- c: string (nullable = true)
> Of course, that operation will fail. Actually, in many cases, people need to 
> insert new rows with columns which is the subset of original table columns. 
> If we can support and fix those issue, Spark SQL's insertion can be more 
> valuable to users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6495) DataFrame#insertInto method should support insert rows with sub-columns

2015-03-24 Thread Chaozhong Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379243#comment-14379243
 ] 

Chaozhong Yang commented on SPARK-6495:
---

Thanks! Maybe what you point at is the resolved issue 
https://issues.apache.org/jira/browse/SPARK-3851. Reading data from  parquet 
files with different but compatible schemas  has been support in Spark 1.3.0. 

> DataFrame#insertInto method should support insert rows with sub-columns
> ---
>
> Key: SPARK-6495
> URL: https://issues.apache.org/jira/browse/SPARK-6495
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Chaozhong Yang
>
> The original table's schema is like this:
>  |-- a: string (nullable = true)
>  |-- b: string (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: string (nullable = true)
> If we want to insert one row(can be transformed into DataFrame) with this 
> schema:
>  |-- a: string (nullable = true)
>  |-- b: string (nullable = true)
>  |-- c: string (nullable = true)
> Of course, that operation will fail. Actually, in many cases, people need to 
> insert new rows with columns which is the subset of original table columns. 
> If we can support and fix those issue, Spark SQL's insertion can be more 
> valuable to users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6522) Standardize Random Number Generation

2015-03-24 Thread RJ Nowling (JIRA)
RJ Nowling created SPARK-6522:
-

 Summary: Standardize Random Number Generation
 Key: SPARK-6522
 URL: https://issues.apache.org/jira/browse/SPARK-6522
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: RJ Nowling
Priority: Minor


Generation of random numbers in Spark has to be handled carefully since 
references to RNGs copy the state to the workers.  As such, a separate RNG 
needs to be seeded for each partition.  Each time random numbers are used in 
Spark's libraries, the RNG seeding is re-implemented, leaving open the 
possibility of mistakes.

It would be useful if RNG seeding was standardized through utility functions or 
random number generation functions that can be called in Spark pipelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6521) executors in the same node read local shuffle file

2015-03-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379207#comment-14379207
 ] 

Apache Spark commented on SPARK-6521:
-

User 'viper-kun' has created a pull request for this issue:
https://github.com/apache/spark/pull/5178

> executors in the same node read local shuffle file
> --
>
> Key: SPARK-6521
> URL: https://issues.apache.org/jira/browse/SPARK-6521
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: xukun
>
> In the past, executor read other executor's shuffle file in the same node by 
> net. This pr make that executors in the same node read local shuffle file In 
> sort-based Shuffle. It will reduce net transport.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6521) executors in the same node read local shuffle file

2015-03-24 Thread xukun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xukun updated SPARK-6521:
-
Summary: executors in the same node read local shuffle file  (was: executor 
in the same node read local shuffle file)

> executors in the same node read local shuffle file
> --
>
> Key: SPARK-6521
> URL: https://issues.apache.org/jira/browse/SPARK-6521
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: xukun
>
> In the past, executor read other executor's shuffle file in the same node by 
> net. This pr make that executors in the same node read local shuffle file In 
> sort-based Shuffle. It will reduce net transport.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6521) executor in the same node read local shuffle file

2015-03-24 Thread xukun (JIRA)
xukun created SPARK-6521:


 Summary: executor in the same node read local shuffle file
 Key: SPARK-6521
 URL: https://issues.apache.org/jira/browse/SPARK-6521
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: xukun


In the past, executor read other executor's shuffle file in the same node by 
net. This pr make that executors in the same node read local shuffle file In 
sort-based Shuffle. It will reduce net transport.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6450) Native Parquet reader does not assign table name as qualifier

2015-03-24 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379125#comment-14379125
 ] 

Cheng Lian edited comment on SPARK-6450 at 3/25/15 1:56 AM:


[~chinnitv], never mind, reproduced this issue with 1.3.0 release and the 
following Spark shell snippet:
{noformat}
sqlContext.sql("""create table if not exists Orders (Country string, 
ProductCategory string, PlacedDate string) stored as parquet""")

sqlContext.sql("""
select
Orders.Country,
Orders.ProductCategory,
count(1)
from
Orders
join (
select
Orders.Country,
count(1) CountryOrderCount
from
Orders
where
to_date(Orders.PlacedDate) > '2015-01-01'
group by
Orders.Country
order by
CountryOrderCount DESC
LIMIT 5
) Top5Countries
on
Top5Countries.Country = Orders.Country
where
to_date(Orders.PlacedDate) > '2015-01-01'
group by
Orders.Country,
Orders.ProductCategory
""").queryExecution.analyzed
{noformat}


was (Author: lian cheng):
[~chinnitv], never mind, reproduced this issue with 1.3.0 release and the 
following Spark shell snippet:
{noformat}
sqlContext.sql("""create table if not exists Orders (Country string, 
ProductCategory string, PlacedDate string) stored as parquet""")

sqlContext.sql("""
select
Orders.Country,
Orders.ProductCategory,
count(1)
from
Orders
join (
select
Orders.Country,
count(1) CountryOrderCount
from
Orders
where
to_date(Orders.PlacedDate) > '2015-01-01'
group by
Orders.Country
order by
CountryOrderCount DESC
LIMIT 5
) Top5Countries
on
Top5Countries.Country = Orders.Country
where
to_date(Orders.PlacedDate) > '2015-01-01'
group by
Orders.Country,
Orders.ProductCategory
""")
{noformat}

> Native Parquet reader does not assign table name as qualifier
> -
>
> Key: SPARK-6450
> URL: https://issues.apache.org/jira/browse/SPARK-6450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Anand Mohan Tumuluri
>Assignee: Michael Armbrust
>Priority: Blocker
>
> The below query was working fine till 1.3 commit 
> 9a151ce58b3e756f205c9f3ebbbf3ab0ba5b33fd.(Yes it definitely works at this 
> commit although this commit is completely unrelated)
> It got broken in 1.3.0 release with an AnalysisException: resolved attributes 
> ... missing from  (although this list contains the fields which it 
> reports missing)
> {code}
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:189)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>   at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
>   at com.sun.proxy.$Proxy17.executeStatementAsync(Unknown Source)
>   at 
> org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>   at org.apache.t

[jira] [Updated] (SPARK-6520) Kyro serialization broken in the shell

2015-03-24 Thread Aaron Defazio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Defazio updated SPARK-6520:
-
Description: 
If I start spark as follows:
{quote}
~/spark-1.3.0-bin-hadoop2.4/bin/spark-shell --master local[1] --conf 
"spark.serializer=org.apache.spark.serializer.KryoSerializer"
{quote}

Then using :paste, run 
{quote}
case class Example(foo : String, bar : String)
val ex = sc.parallelize(List(Example("foo1", "bar1"), Example("foo2", 
"bar2"))).collect()
{quote}

I get the error:
{quote}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): java.io.IOException: com.esotericsoftware.kryo.KryoException: 
Error constructing instance of class: $line3.$read
Serialization trace:
$VAL10 ($iwC)
$outer ($iwC$$iwC)
$outer ($iwC$$iwC$Example)
  at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1140)
  at 
org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:979)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1873)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
  at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
  at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
{quote}

As far as I can tell, when using :paste, Kyro serialization doesn't work for 
classes defined in within the same paste. It does work when the statements are 
entered without paste.

This issue seems serious to me, since Kyro serialization is virtually mandatory 
for performance (20x slower with default serialization on my problem), and I'm 
assuming feature parity between spark-shell and spark-submit is a goal.
Note that this is different from SPARK-6497, which covers the case when Kyro is 
set to require registration.

  was:
If I start spark as follows:
{quote}
~/spark-1.3.0-bin-hadoop2.4/bin/spark-shell --master local[1] --conf 
"spark.serializer=org.apache.spark.serializer.KryoSerializer"
{quote}

Then using :paste, run 
{quote}
case class Example(foo : String, bar : String)
val ex = sc.parallelize(List(Example("foo1", "bar1"), Example("foo2", 
"bar2"))).collect()
{quote}

I get the error:
{quote}
$VAL10 ($iwC)
$outer ($iwC$$iwC)
$outer ($iwC$$iwC$Example)
  at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1140)
  at 
org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:979)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1873)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
  at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
  at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
{quote}

As far as I can tell, when using :paste,

[jira] [Created] (SPARK-6520) Kyro serialization broken in the shell

2015-03-24 Thread Aaron Defazio (JIRA)
Aaron Defazio created SPARK-6520:


 Summary: Kyro serialization broken in the shell
 Key: SPARK-6520
 URL: https://issues.apache.org/jira/browse/SPARK-6520
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.0
Reporter: Aaron Defazio


If I start spark as follows:
{quote}
~/spark-1.3.0-bin-hadoop2.4/bin/spark-shell --master local[1] --conf 
"spark.serializer=org.apache.spark.serializer.KryoSerializer"
{quote}

Then using :paste, run 
{quote}
case class Example(foo : String, bar : String)
val ex = sc.parallelize(List(Example("foo1", "bar1"), Example("foo2", 
"bar2"))).collect()
{quote}

I get the error:
{quote}
$VAL10 ($iwC)
$outer ($iwC$$iwC)
$outer ($iwC$$iwC$Example)
  at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1140)
  at 
org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:979)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1873)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
  at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
  at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
{quote}

As far as I can tell, when using :paste, Kyro serialization doesn't work for 
classes defined in within the same paste. It does work when the statements are 
entered without paste.

This issue seems serious to me, since Kyro serialization is virtually mandatory 
for performance (20x slower with default serialization on my problem), and I'm 
assuming feature parity between spark-shell and spark-submit is a goal.
Note that this is different from SPARK-6497, which covers the case when Kyro is 
set to require registration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6450) Native Parquet reader does not assign table name as qualifier

2015-03-24 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379125#comment-14379125
 ] 

Cheng Lian commented on SPARK-6450:
---

[~chinnitv], never mind, reproduced this issue with 1.3.0 release and the 
following Spark shell snippet:
{noformat}
sqlContext.sql("""create table if not exists Orders (Country string, 
ProductCategory string, PlacedDate string) stored as parquet""")

sqlContext.sql("""
select
Orders.Country,
Orders.ProductCategory,
count(1)
from
Orders
join (
select
Orders.Country,
count(1) CountryOrderCount
from
Orders
where
to_date(Orders.PlacedDate) > '2015-01-01'
group by
Orders.Country
order by
CountryOrderCount DESC
LIMIT 5
) Top5Countries
on
Top5Countries.Country = Orders.Country
where
to_date(Orders.PlacedDate) > '2015-01-01'
group by
Orders.Country,
Orders.ProductCategory
""")
{noformat}

> Native Parquet reader does not assign table name as qualifier
> -
>
> Key: SPARK-6450
> URL: https://issues.apache.org/jira/browse/SPARK-6450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Anand Mohan Tumuluri
>Assignee: Michael Armbrust
>Priority: Blocker
>
> The below query was working fine till 1.3 commit 
> 9a151ce58b3e756f205c9f3ebbbf3ab0ba5b33fd.(Yes it definitely works at this 
> commit although this commit is completely unrelated)
> It got broken in 1.3.0 release with an AnalysisException: resolved attributes 
> ... missing from  (although this list contains the fields which it 
> reports missing)
> {code}
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:189)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>   at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
>   at com.sun.proxy.$Proxy17.executeStatementAsync(Unknown Source)
>   at 
> org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}
> select Orders.Country, Orders.ProductCategory,count(1) from Orders join 
> (select Orders.Country, count(1) CountryOrderCount from Orders where 
> to_date(Orders.PlacedDate) > '2015-01-01' group by Orders.Country order by 
> CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = 
> Orders.Country where to_date(Orders.PlacedDate) > '2015-01-01' group by 
> Orders.Country,Orders.ProductCategory;
> {code}
> The temporary workaround is to add explicit alias for the table

[jira] [Commented] (SPARK-6450) Native Parquet reader does not assign table name as qualifier

2015-03-24 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379114#comment-14379114
 ] 

Cheng Lian commented on SPARK-6450:
---

[~chinnitv], could you please provide the DDL of the Orders table to help 
reproducing this issue?

> Native Parquet reader does not assign table name as qualifier
> -
>
> Key: SPARK-6450
> URL: https://issues.apache.org/jira/browse/SPARK-6450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Anand Mohan Tumuluri
>Assignee: Michael Armbrust
>Priority: Blocker
>
> The below query was working fine till 1.3 commit 
> 9a151ce58b3e756f205c9f3ebbbf3ab0ba5b33fd.(Yes it definitely works at this 
> commit although this commit is completely unrelated)
> It got broken in 1.3.0 release with an AnalysisException: resolved attributes 
> ... missing from  (although this list contains the fields which it 
> reports missing)
> {code}
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:189)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>   at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
>   at com.sun.proxy.$Proxy17.executeStatementAsync(Unknown Source)
>   at 
> org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}
> select Orders.Country, Orders.ProductCategory,count(1) from Orders join 
> (select Orders.Country, count(1) CountryOrderCount from Orders where 
> to_date(Orders.PlacedDate) > '2015-01-01' group by Orders.Country order by 
> CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = 
> Orders.Country where to_date(Orders.PlacedDate) > '2015-01-01' group by 
> Orders.Country,Orders.ProductCategory;
> {code}
> The temporary workaround is to add explicit alias for the table Orders
> {code}
> select o.Country, o.ProductCategory,count(1) from Orders o join (select 
> r.Country, count(1) CountryOrderCount from Orders r where 
> to_date(r.PlacedDate) > '2015-01-01' group by r.Country order by 
> CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = 
> o.Country where to_date(o.PlacedDate) > '2015-01-01' group by 
> o.Country,o.ProductCategory;
> {code}
> However this change not only affects self joins, it also seems to affect 
> union queries as well, like the below query which was again working 
> before(commit 9a151ce) got broken
> {code}
> select Orders.Country,null,count(1) OrderCount from Orders group by 
> Orders.Country,null
> union all
> select null,Orders.ProductCategory,count(1) OrderCount from Orders group by 
> null, Orders.ProductCategory

[jira] [Updated] (SPARK-6413) For data source tables, we should provide better output for described formatted.

2015-03-24 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6413:

Priority: Major  (was: Critical)

> For data source tables, we should provide better output for described 
> formatted.
> 
>
> Key: SPARK-6413
> URL: https://issues.apache.org/jira/browse/SPARK-6413
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Right now, we will show Hive's stuff like SerDe. Users will be confused when 
> they see the output of "DESCRIBE FORMATTED" (it is a Hive native command for 
> now) and think the table is not stored in the "right" format. Actually, the 
> table is indeed stored in the right format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6413) For data source tables, we should provide better output for described formatted.

2015-03-24 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6413:

Target Version/s: 1.4.0  (was: 1.3.1)

> For data source tables, we should provide better output for described 
> formatted.
> 
>
> Key: SPARK-6413
> URL: https://issues.apache.org/jira/browse/SPARK-6413
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> Right now, we will show Hive's stuff like SerDe. Users will be confused when 
> they see the output of "DESCRIBE FORMATTED" (it is a Hive native command for 
> now) and think the table is not stored in the "right" format. Actually, the 
> table is indeed stored in the right format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6413) For data source tables, we should provide better output for described formatted.

2015-03-24 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6413:

Description: Right now, we will show Hive's stuff like SerDe. Users will be 
confused when they see the output of "DESCRIBE FORMATTED" (it is a Hive native 
command for now) and think the table is not stored in the "right" format. 
Actually, the table is indeed stored in the right format.  (was: Right now, we 
will show Hive's stuff like SerDe. Users will be confused when they see the 
output of "DESCRIBE EXTENDED/FORMATTED" and think the table is not stored in 
the "right" format. Actually, the table is indeed stored in the right format.)

> For data source tables, we should provide better output for described 
> formatted.
> 
>
> Key: SPARK-6413
> URL: https://issues.apache.org/jira/browse/SPARK-6413
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> Right now, we will show Hive's stuff like SerDe. Users will be confused when 
> they see the output of "DESCRIBE FORMATTED" (it is a Hive native command for 
> now) and think the table is not stored in the "right" format. Actually, the 
> table is indeed stored in the right format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6413) For data source tables, we should provide better output for described formatted.

2015-03-24 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6413:

Summary: For data source tables, we should provide better output for 
described formatted.  (was: For data source tables, we should provide better 
output for described extended/formatted.)

> For data source tables, we should provide better output for described 
> formatted.
> 
>
> Key: SPARK-6413
> URL: https://issues.apache.org/jira/browse/SPARK-6413
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> Right now, we will show Hive's stuff like SerDe. Users will be confused when 
> they see the output of "DESCRIBE EXTENDED/FORMATTED" and think the table is 
> not stored in the "right" format. Actually, the table is indeed stored in the 
> right format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6450) Native Parquet reader does not assign table name as qualifier

2015-03-24 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379086#comment-14379086
 ] 

Michael Armbrust commented on SPARK-6450:
-

I am having trouble reproducing this issue.  Can you explain more how you are 
creating the table in question?

> Native Parquet reader does not assign table name as qualifier
> -
>
> Key: SPARK-6450
> URL: https://issues.apache.org/jira/browse/SPARK-6450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Anand Mohan Tumuluri
>Assignee: Michael Armbrust
>Priority: Blocker
>
> The below query was working fine till 1.3 commit 
> 9a151ce58b3e756f205c9f3ebbbf3ab0ba5b33fd.(Yes it definitely works at this 
> commit although this commit is completely unrelated)
> It got broken in 1.3.0 release with an AnalysisException: resolved attributes 
> ... missing from  (although this list contains the fields which it 
> reports missing)
> {code}
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:189)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>   at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
>   at com.sun.proxy.$Proxy17.executeStatementAsync(Unknown Source)
>   at 
> org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}
> select Orders.Country, Orders.ProductCategory,count(1) from Orders join 
> (select Orders.Country, count(1) CountryOrderCount from Orders where 
> to_date(Orders.PlacedDate) > '2015-01-01' group by Orders.Country order by 
> CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = 
> Orders.Country where to_date(Orders.PlacedDate) > '2015-01-01' group by 
> Orders.Country,Orders.ProductCategory;
> {code}
> The temporary workaround is to add explicit alias for the table Orders
> {code}
> select o.Country, o.ProductCategory,count(1) from Orders o join (select 
> r.Country, count(1) CountryOrderCount from Orders r where 
> to_date(r.PlacedDate) > '2015-01-01' group by r.Country order by 
> CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = 
> o.Country where to_date(o.PlacedDate) > '2015-01-01' group by 
> o.Country,o.ProductCategory;
> {code}
> However this change not only affects self joins, it also seems to affect 
> union queries as well, like the below query which was again working 
> before(commit 9a151ce) got broken
> {code}
> select Orders.Country,null,count(1) OrderCount from Orders group by 
> Orders.Country,null
> union all
> select null,Orders.ProductCategory,count(1) OrderCount from Orders group by 
> null

[jira] [Updated] (SPARK-6519) Add spark.ml API for Hierarchical KMeans

2015-03-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6519:
-
Summary: Add spark.ml API for Hierarchical KMeans  (was: Add wrapper 
classes of DataFrame in spark.ml)

> Add spark.ml API for Hierarchical KMeans
> 
>
> Key: SPARK-6519
> URL: https://issues.apache.org/jira/browse/SPARK-6519
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yu Ishikawa
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5771) Number of Cores in Completed Applications of Standalone Master Web Page always be 0 if sc.stop() is called

2015-03-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379000#comment-14379000
 ] 

Apache Spark commented on SPARK-5771:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/5177

> Number of Cores in Completed Applications of Standalone Master Web Page 
> always be 0 if sc.stop() is called
> --
>
> Key: SPARK-5771
> URL: https://issues.apache.org/jira/browse/SPARK-5771
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.1
>Reporter: Liangliang Gu
>Assignee: Liangliang Gu
>Priority: Minor
>
> In Standalone mode, the number of cores in Completed Applications of the 
> Master Web Page will always be zero, if sc.stop() is called.
> But the number will always be right, if sc.stop() is not called.
> The reason maybe: 
> after sc.stop() is called, the function removeExecutor of class 
> ApplicationInfo will be called, thus reduce the variable coresGranted to 
> zero.  The variable coresGranted is used to display the number of Cores on 
> the Web Page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6517) Implement the Algorithm of Hierarchical Clustering

2015-03-24 Thread Yu Ishikawa (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Ishikawa updated SPARK-6517:
---
Summary: Implement the Algorithm of Hierarchical Clustering  (was: 
Implementing the Algorithm)

> Implement the Algorithm of Hierarchical Clustering
> --
>
> Key: SPARK-6517
> URL: https://issues.apache.org/jira/browse/SPARK-6517
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Yu Ishikawa
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6519) Add wrapper classes of DataFrame in spark.ml

2015-03-24 Thread Yu Ishikawa (JIRA)
Yu Ishikawa created SPARK-6519:
--

 Summary: Add wrapper classes of DataFrame in spark.ml
 Key: SPARK-6519
 URL: https://issues.apache.org/jira/browse/SPARK-6519
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Yu Ishikawa






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6518) Add an example and document for Hierarchical Clustering

2015-03-24 Thread Yu Ishikawa (JIRA)
Yu Ishikawa created SPARK-6518:
--

 Summary: Add an example and document for Hierarchical Clustering
 Key: SPARK-6518
 URL: https://issues.apache.org/jira/browse/SPARK-6518
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Yu Ishikawa






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6517) Implementing the Algorithm

2015-03-24 Thread Yu Ishikawa (JIRA)
Yu Ishikawa created SPARK-6517:
--

 Summary: Implementing the Algorithm
 Key: SPARK-6517
 URL: https://issues.apache.org/jira/browse/SPARK-6517
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Yu Ishikawa






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6081) DriverRunner doesn't support pulling HTTP/HTTPS URIs

2015-03-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6081:
-
Affects Version/s: 1.0.0

> DriverRunner doesn't support pulling HTTP/HTTPS URIs
> 
>
> Key: SPARK-6081
> URL: https://issues.apache.org/jira/browse/SPARK-6081
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.0.0
>Reporter: Timothy Chen
>Priority: Minor
>
> Standalone cluster mode according to the docs supports specifying http|https 
> jar urls, but when actually called the urls passed to the driver runner is 
> not able to pull http uris due to the usage of hadoopfs get.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6469) Improving documentation on YARN local directories usage

2015-03-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6469:
-
Affects Version/s: 1.0.0

> Improving documentation on YARN local directories usage
> ---
>
> Key: SPARK-6469
> URL: https://issues.apache.org/jira/browse/SPARK-6469
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, YARN
>Affects Versions: 1.0.0
>Reporter: Christophe Préaud
>Assignee: Christophe Préaud
>Priority: Minor
> Fix For: 1.3.1, 1.4.0
>
> Attachments: TestYarnVars.scala
>
>
> According to the [Spark YARN doc 
> page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
>  Spark executors will use the local directories configured for YARN, not 
> {{spark.local.dir}} which should be ignored.
> However it should be noted that in yarn-client mode, though the executors 
> will indeed use the local directories configured for YARN, the driver will 
> not, because it is not running on the YARN cluster; the driver in yarn-client 
> will use the local directories defined in {{spark.local.dir}}
> Can this please be clarified in the Spark YARN documentation above?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6469) Improving documentation on YARN local directories usage

2015-03-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6469:
-
Component/s: Documentation

> Improving documentation on YARN local directories usage
> ---
>
> Key: SPARK-6469
> URL: https://issues.apache.org/jira/browse/SPARK-6469
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, YARN
>Affects Versions: 1.0.0
>Reporter: Christophe Préaud
>Assignee: Christophe Préaud
>Priority: Minor
> Fix For: 1.3.1, 1.4.0
>
> Attachments: TestYarnVars.scala
>
>
> According to the [Spark YARN doc 
> page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
>  Spark executors will use the local directories configured for YARN, not 
> {{spark.local.dir}} which should be ignored.
> However it should be noted that in yarn-client mode, though the executors 
> will indeed use the local directories configured for YARN, the driver will 
> not, because it is not running on the YARN cluster; the driver in yarn-client 
> will use the local directories defined in {{spark.local.dir}}
> Can this please be clarified in the Spark YARN documentation above?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6469) Improving documentation on YARN local directories usage

2015-03-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-6469.

  Resolution: Fixed
   Fix Version/s: 1.4.0
  1.3.1
Assignee: Christophe Préaud
Target Version/s: 1.3.1, 1.4.0

> Improving documentation on YARN local directories usage
> ---
>
> Key: SPARK-6469
> URL: https://issues.apache.org/jira/browse/SPARK-6469
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, YARN
>Affects Versions: 1.0.0
>Reporter: Christophe Préaud
>Assignee: Christophe Préaud
>Priority: Minor
> Fix For: 1.3.1, 1.4.0
>
> Attachments: TestYarnVars.scala
>
>
> According to the [Spark YARN doc 
> page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
>  Spark executors will use the local directories configured for YARN, not 
> {{spark.local.dir}} which should be ignored.
> However it should be noted that in yarn-client mode, though the executors 
> will indeed use the local directories configured for YARN, the driver will 
> not, because it is not running on the YARN cluster; the driver in yarn-client 
> will use the local directories defined in {{spark.local.dir}}
> Can this please be clarified in the Spark YARN documentation above?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6512) Add contains to OpenHashMap

2015-03-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6512.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5171
[https://github.com/apache/spark/pull/5171]

> Add contains to OpenHashMap
> ---
>
> Key: SPARK-6512
> URL: https://issues.apache.org/jira/browse/SPARK-6512
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
> Fix For: 1.4.0
>
>
> Add `contains` to test whether a key exists in an OpenHashMap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6515) Use while(true) in OpenHashSet.getPos

2015-03-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6515:
-
Priority: Minor  (was: Major)

> Use while(true) in OpenHashSet.getPos
> -
>
> Key: SPARK-6515
> URL: https://issues.apache.org/jira/browse/SPARK-6515
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6515) Use while(true) in OpenHashSet.getPos

2015-03-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6515:
-
Description: Though I don't see any bug in the existing code, using `while 
(true)` makes the code read better.

> Use while(true) in OpenHashSet.getPos
> -
>
> Key: SPARK-6515
> URL: https://issues.apache.org/jira/browse/SPARK-6515
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>
> Though I don't see any bug in the existing code, using `while (true)` makes 
> the code read better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6515) Use while(true) in OpenHashSet.getPos

2015-03-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6515:
-
Summary: Use while(true) in OpenHashSet.getPos  (was: OpenHashSet returns 
invalid position when the data size is 1)

> Use while(true) in OpenHashSet.getPos
> -
>
> Key: SPARK-6515
> URL: https://issues.apache.org/jira/browse/SPARK-6515
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6515) OpenHashSet returns invalid position when the data size is 1

2015-03-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6515:
-
Target Version/s: 1.4.0  (was: 1.1.2, 1.2.2, 1.3.1, 1.4.0)

> OpenHashSet returns invalid position when the data size is 1
> 
>
> Key: SPARK-6515
> URL: https://issues.apache.org/jira/browse/SPARK-6515
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6515) OpenHashSet returns invalid position when the data size is 1

2015-03-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378967#comment-14378967
 ] 

Apache Spark commented on SPARK-6515:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/5176

> OpenHashSet returns invalid position when the data size is 1
> 
>
> Key: SPARK-6515
> URL: https://issues.apache.org/jira/browse/SPARK-6515
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6515) OpenHashSet returns invalid position when the data size is 1

2015-03-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6515:
-
Issue Type: Improvement  (was: Bug)

> OpenHashSet returns invalid position when the data size is 1
> 
>
> Key: SPARK-6515
> URL: https://issues.apache.org/jira/browse/SPARK-6515
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6515) OpenHashSet returns invalid position when the data size is 1

2015-03-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6515:
-
Affects Version/s: (was: 1.2.1)
   (was: 1.3.0)
   (was: 1.1.1)

> OpenHashSet returns invalid position when the data size is 1
> 
>
> Key: SPARK-6515
> URL: https://issues.apache.org/jira/browse/SPARK-6515
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-5771) Number of Cores in Completed Applications of Standalone Master Web Page always be 0 if sc.stop() is called

2015-03-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reopened SPARK-5771:
--

> Number of Cores in Completed Applications of Standalone Master Web Page 
> always be 0 if sc.stop() is called
> --
>
> Key: SPARK-5771
> URL: https://issues.apache.org/jira/browse/SPARK-5771
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.1
>Reporter: Liangliang Gu
>Assignee: Liangliang Gu
>Priority: Minor
>
> In Standalone mode, the number of cores in Completed Applications of the 
> Master Web Page will always be zero, if sc.stop() is called.
> But the number will always be right, if sc.stop() is not called.
> The reason maybe: 
> after sc.stop() is called, the function removeExecutor of class 
> ApplicationInfo will be called, thus reduce the variable coresGranted to 
> zero.  The variable coresGranted is used to display the number of Cores on 
> the Web Page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5771) Number of Cores in Completed Applications of Standalone Master Web Page always be 0 if sc.stop() is called

2015-03-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5771:
-
Fix Version/s: (was: 1.4.0)

> Number of Cores in Completed Applications of Standalone Master Web Page 
> always be 0 if sc.stop() is called
> --
>
> Key: SPARK-5771
> URL: https://issues.apache.org/jira/browse/SPARK-5771
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.1
>Reporter: Liangliang Gu
>Assignee: Liangliang Gu
>Priority: Minor
>
> In Standalone mode, the number of cores in Completed Applications of the 
> Master Web Page will always be zero, if sc.stop() is called.
> But the number will always be right, if sc.stop() is not called.
> The reason maybe: 
> after sc.stop() is called, the function removeExecutor of class 
> ApplicationInfo will be called, thus reduce the variable coresGranted to 
> zero.  The variable coresGranted is used to display the number of Cores on 
> the Web Page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6502) HiveThriftServer2 fails to inspect underlying Hive version when compiled against Hive 0.12.0

2015-03-24 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6502:
--
Shepherd: Cheng Lian

> HiveThriftServer2 fails to inspect underlying Hive version when compiled 
> against Hive 0.12.0
> 
>
> Key: SPARK-6502
> URL: https://issues.apache.org/jira/browse/SPARK-6502
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Cheng Lian
>
> While initializing the {{SparkContext}} in {{HiveThriftServer2}}, underlying 
> Hive version is set into the {{SparkConf}} as {{spark.sql.hive.version}}, so 
> that users can query the Hive version via {{SET spark.sql.hive.version;}}.
> When compiled against Hive 0.12.0, this server replies {{}} when 
> users query this property.
> Hive 0.13.1 is fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6505) Remove the reflection call in HiveFunctionWrapper

2015-03-24 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6505:
--
Shepherd: Cheng Lian

> Remove the reflection call in HiveFunctionWrapper
> -
>
> Key: SPARK-6505
> URL: https://issues.apache.org/jira/browse/SPARK-6505
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1, 1.3.0
>Reporter: Cheng Lian
>Assignee: Cheng Hao
>Priority: Minor
>
> While trying to fix SPARK-4785, we introduced {{HiveFunctionWrapper}}, and 
> added two not so necessary reflection calls there. These calls had caused 
> some dependency hell problems for MapR distribution of Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6503) Create Jenkins builder for testing Spark SQL with Hive 0.12.0

2015-03-24 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6503:
--
Shepherd: Cheng Lian

> Create Jenkins builder for testing Spark SQL with Hive 0.12.0
> -
>
> Key: SPARK-6503
> URL: https://issues.apache.org/jira/browse/SPARK-6503
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Cheng Lian
>
> Currently, from the perspective of Spark SQL, the {{dev/run-tests}} script 
> does the following steps to build and test Spark SQL:
> # Builds Spark SQL against Hive 0.12.0 to check for compilation errors
> # Cleans the build
> # Builds Spark SQL against Hive 0.13.1 to check for compilation errors.
> # Runs unit tests against Hive 0.13.1
> Apparently, Spark SQL with Hive 0.12.0 is not tested.
> Two improvements could be done here:
> # When executed, {{dev/run-tests}} should always build and test a single 
> version of Hive. The version could be passed in as environment variable.
> # Separate Jenkins builders should be set up to test Hive 0.12.0 code paths.
> We probably only want the PR builder run against Hive 0.13.1 to minimize 
> build time, and make the master builder take care of both Hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6507) Create separate Hive Driver instance for each SQL query in HiveThriftServer2

2015-03-24 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6507:
--
Shepherd: Cheng Lian

> Create separate Hive Driver instance for each SQL query in HiveThriftServer2
> 
>
> Key: SPARK-6507
> URL: https://issues.apache.org/jira/browse/SPARK-6507
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
>Reporter: Cheng Lian
>
> In the current implementation of HiveThriftServer2, Hive {{Driver}} instances 
> are cached and reused among queries. However, {{Driver}} is not thread-safe, 
> and may cause racing conditions. In SPARK-4908, we synchronized 
> {{HiveContext.runHive}} to avoid this issue, but this affects concurrency 
> negatively, because no two native commands can be executed concurrently. This 
> is pretty bad for heavy commands like ANALYZE.
> Please refer [this 
> comment|https://issues.apache.org/jira/browse/SPARK-4908?focusedCommentId=14264469&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14264469]
>  in SPARK-4908 for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6387) HTTP mode of HiveThriftServer2 doesn't work when built with Hive 0.12.0

2015-03-24 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6387:
--
Shepherd: Cheng Lian

> HTTP mode of HiveThriftServer2 doesn't work when built with Hive 0.12.0
> ---
>
> Key: SPARK-6387
> URL: https://issues.apache.org/jira/browse/SPARK-6387
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.2.1, 1.3.0
>Reporter: Cheng Lian
>
> Reproduction steps:
> # Compile Spark against Hive 0.12.0
>   {noformat}$ ./build/sbt 
> -Pyarn,hadoop-2.4,hive,hive-thriftserver,hive-0.12.0,scala-2.10 
> -Dhadoop.version=2.4.1 clean assembly/assembly{noformat}
> # Start the Thrift server in HTTP mode
>   Add the following stanza in {{hive-site.xml}}:
>   {noformat}
>   hive.server2.transport.mode
>   http
> {noformat}
>   and
>   {noformat}$ ./bin/start-thriftserver.sh{noformat}
> # Connect to the Thrift server via Beeline
>   {noformat}$ ./bin/beeline -u 
> "jdbc:hive2://localhost:10001/default?hive.server2.transport.mode=http;hive.server2.thrift.http.path=cliservice"{noformat}
> # Execute any query and check the server log
>   We can see that no query execution related logs are output.
> The reason is that, when running under HTTP mode, although we pass in a 
> {{SparkSQLCLIService}} instance 
> ([here|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L102])
>  to {{ThriftHttpCLIService}}, Hive 0.12.0 just ignores it, and instantiate a 
> new {{CLIService}} 
> ([here|https://github.com/apache/hive/blob/release-0.12.0/service/src/java/org/apache/hive/service/cli/thrift/ThriftHttpCLIService.java#L91-L92]
>  and 
> [here|https://github.com/apache/hive/blob/release-0.12.0/service/src/java/org/apache/hive/service/cli/thrift/EmbeddedThriftBinaryCLIService.java#L32]).
> Notice that while compiling against Hive 0.13.1, Spark SQL doesn't suffer 
> from this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6501) Blacklist Hive 0.13.1 specific tests when compiled against Hive 0.12.0

2015-03-24 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6501:
--
Shepherd: Cheng Lian

> Blacklist Hive 0.13.1 specific tests when compiled against Hive 0.12.0
> --
>
> Key: SPARK-6501
> URL: https://issues.apache.org/jira/browse/SPARK-6501
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1, 1.3.0
>Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6109) Unit tests fail when compiled against Hive 0.12.0

2015-03-24 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6109:
--
Shepherd: Cheng Lian
Assignee: (was: Cheng Lian)

> Unit tests fail when compiled against Hive 0.12.0
> -
>
> Key: SPARK-6109
> URL: https://issues.apache.org/jira/browse/SPARK-6109
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Cheng Lian
>
> Currently, Jenkins doesn't run unit tests against Hive 0.12.0, and several 
> Hive 0.13.1 specific test cases always fail against Hive 0.12.0. Need to 
> blacklist them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6516) Coupling between default Hadoop versions in Spark build vs. ec2 scripts

2015-03-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-6516:


 Summary: Coupling between default Hadoop versions in Spark build 
vs. ec2 scripts
 Key: SPARK-6516
 URL: https://issues.apache.org/jira/browse/SPARK-6516
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor


When we change default Hadoop versions in the Spark build conf and/or in the 
EC2 scripts, we should keep the 2 in synch.  (When out of synch, users may be 
surprised if they create an EC2 cluster, compile Spark on it, and try to run 
that version of Spark.)  Making sure this is set in the same place would be 
great.

An even better fix might be for Spark build to check for what is available and 
adjust the default based on that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6413) For data source tables, we should provide better output for described extended/formatted.

2015-03-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6413:

Assignee: Yin Huai

> For data source tables, we should provide better output for described 
> extended/formatted.
> -
>
> Key: SPARK-6413
> URL: https://issues.apache.org/jira/browse/SPARK-6413
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> Right now, we will show Hive's stuff like SerDe. Users will be confused when 
> they see the output of "DESCRIBE EXTENDED/FORMATTED" and think the table is 
> not stored in the "right" format. Actually, the table is indeed stored in the 
> right format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6450) Native Parquet reader does not assign table name as qualifier

2015-03-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-6450:
---

Assignee: Michael Armbrust  (was: Cheng Lian)

> Native Parquet reader does not assign table name as qualifier
> -
>
> Key: SPARK-6450
> URL: https://issues.apache.org/jira/browse/SPARK-6450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Anand Mohan Tumuluri
>Assignee: Michael Armbrust
>Priority: Blocker
>
> The below query was working fine till 1.3 commit 
> 9a151ce58b3e756f205c9f3ebbbf3ab0ba5b33fd.(Yes it definitely works at this 
> commit although this commit is completely unrelated)
> It got broken in 1.3.0 release with an AnalysisException: resolved attributes 
> ... missing from  (although this list contains the fields which it 
> reports missing)
> {code}
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:189)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>   at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
>   at com.sun.proxy.$Proxy17.executeStatementAsync(Unknown Source)
>   at 
> org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}
> select Orders.Country, Orders.ProductCategory,count(1) from Orders join 
> (select Orders.Country, count(1) CountryOrderCount from Orders where 
> to_date(Orders.PlacedDate) > '2015-01-01' group by Orders.Country order by 
> CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = 
> Orders.Country where to_date(Orders.PlacedDate) > '2015-01-01' group by 
> Orders.Country,Orders.ProductCategory;
> {code}
> The temporary workaround is to add explicit alias for the table Orders
> {code}
> select o.Country, o.ProductCategory,count(1) from Orders o join (select 
> r.Country, count(1) CountryOrderCount from Orders r where 
> to_date(r.PlacedDate) > '2015-01-01' group by r.Country order by 
> CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = 
> o.Country where to_date(o.PlacedDate) > '2015-01-01' group by 
> o.Country,o.ProductCategory;
> {code}
> However this change not only affects self joins, it also seems to affect 
> union queries as well, like the below query which was again working 
> before(commit 9a151ce) got broken
> {code}
> select Orders.Country,null,count(1) OrderCount from Orders group by 
> Orders.Country,null
> union all
> select null,Orders.ProductCategory,count(1) OrderCount from Orders group by 
> null, Orders.ProductCategory
> {code}
> also fails with a Analysis exception.
> The workaround is to add different a

[jira] [Closed] (SPARK-3570) Shuffle write time does not include time to open shuffle files

2015-03-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3570.

  Resolution: Fixed
   Fix Version/s: 1.4.0
  1.3.1
Target Version/s: 1.3.1, 1.4.0

> Shuffle write time does not include time to open shuffle files
> --
>
> Key: SPARK-3570
> URL: https://issues.apache.org/jira/browse/SPARK-3570
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.2, 1.0.2, 1.1.0
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
> Fix For: 1.3.1, 1.4.0
>
> Attachments: 3a_1410854905_0_job_log_waterfall.pdf, 
> 3a_1410957857_0_job_log_waterfall.pdf
>
>
> Currently, the reported shuffle write time does not include time to open the 
> shuffle files.  This time can be very significant when the disk is highly 
> utilized and many shuffle files exist on the machine (I'm not sure how severe 
> this is in 1.0 onward -- since shuffle files are automatically deleted, this 
> may be less of an issue because there are fewer old files sitting around).  
> In experiments I did, in extreme cases, adding the time to open files can 
> increase the shuffle write time from 5ms (of a 2 second task) to 1 second.  
> We should fix this for better performance debugging.
> Thanks [~shivaram] for helping to diagnose this problem.  cc [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6088) UI is malformed when tasks fetch remote results

2015-03-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6088:
-
Affects Version/s: 1.3.0

> UI is malformed when tasks fetch remote results
> ---
>
> Key: SPARK-6088
> URL: https://issues.apache.org/jira/browse/SPARK-6088
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
> Fix For: 1.3.1, 1.4.0
>
> Attachments: Screenshot 2015-02-28 18.24.42.png
>
>
> There are three issues when tasks get remote results:
> (1) The status never changes from GET_RESULT to SUCCEEDED
> (2) The time to get the result is shown as the absolute time (resulting in a 
> non-sensical output that says getting the result took >1 million hours) 
> rather than the elapsed time
> (3) The getting result time is included as part of the scheduler delay
> cc [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6088) UI is malformed when tasks fetch remote results

2015-03-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-6088.

  Resolution: Fixed
   Fix Version/s: 1.4.0
  1.3.1
Target Version/s: 1.3.1, 1.4.0  (was: 1.3.0)

> UI is malformed when tasks fetch remote results
> ---
>
> Key: SPARK-6088
> URL: https://issues.apache.org/jira/browse/SPARK-6088
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
> Fix For: 1.3.1, 1.4.0
>
> Attachments: Screenshot 2015-02-28 18.24.42.png
>
>
> There are three issues when tasks get remote results:
> (1) The status never changes from GET_RESULT to SUCCEEDED
> (2) The time to get the result is shown as the absolute time (resulting in a 
> non-sensical output that says getting the result took >1 million hours) 
> rather than the elapsed time
> (3) The getting result time is included as part of the scheduler delay
> cc [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6430) Cannot resolve column correctlly when using left semi join

2015-03-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-6430:
---

Assignee: Michael Armbrust

> Cannot resolve column correctlly when using left semi join
> --
>
> Key: SPARK-6430
> URL: https://issues.apache.org/jira/browse/SPARK-6430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Spark 1.3.0 on yarn mode
>Reporter: zzc
>Assignee: Michael Armbrust
>
> My code:
> {quote}
> case class TestData(key: Int, value: String)
> case class TestData2(a: Int, b: Int)
> import org.apache.spark.sql.execution.joins._
> import sqlContext.implicits._
> val testData = sc.parallelize(
> (1 to 100).map(i => TestData(i, i.toString))).toDF()
> testData.registerTempTable("testData")
> val testData2 = sc.parallelize(
>   TestData2(1, 1) ::
>   TestData2(1, 2) ::
>   TestData2(2, 1) ::
>   TestData2(2, 2) ::
>   TestData2(3, 1) ::
>   TestData2(3, 2) :: Nil, 2).toDF()
> testData2.registerTempTable("testData2")
> //val tmp = sqlContext.sql("SELECT * FROM testData *LEFT SEMI JOIN* testData2 
> ON key = a ")
> val tmp = sqlContext.sql("SELECT testData2.b, count(testData2.b) FROM 
> testData *LEFT SEMI JOIN* testData2 ON key = testData2.a group by 
> testData2.b")
> tmp.explain()
> {quote}
> Error log:
> {quote}
> org.apache.spark.sql.AnalysisException: cannot resolve 'testData2.b' given 
> input columns key, value; line 1 pos 108
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
> {quote}
> {quote}SELECT * FROM testData LEFT SEMI JOIN testData2 ON key = a{quote} is 
> correct, 
> {quote}
> SELECT a FROM testData LEFT SEMI JOIN testData2 ON key = a
> SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = a group by b
> SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = testData2.a 
> group by testData2.b
> SELECT testData2.b, count(testData2.b) FROM testData LEFT SEMI JOIN testData2 
> ON key = testData2.a group by testData2.b
> {quote} are incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6152) Spark does not support Java 8 compiled Scala classes

2015-03-24 Thread Martin Grotzke (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378918#comment-14378918
 ] 

Martin Grotzke commented on SPARK-6152:
---

Btw, we just released kryo 3.0.1: 
https://github.com/EsotericSoftware/kryo/blob/master/CHANGES.md#2240---300-2014-0-04

> Spark does not support Java 8 compiled Scala classes
> 
>
> Key: SPARK-6152
> URL: https://issues.apache.org/jira/browse/SPARK-6152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: Java 8+
> Scala 2.11
>Reporter: Ronald Chen
>Priority: Minor
>
> Spark uses reflectasm to check Scala closures which fails if the *user 
> defined Scala closures* are compiled to Java 8 class version
> The cause is reflectasm does not support Java 8
> https://github.com/EsotericSoftware/reflectasm/issues/35
> Workaround:
> Don't compile Scala classes to Java 8, Scala 2.11 does not support nor 
> require any Java 8 features
> Stack trace:
> {code}
> java.lang.IllegalArgumentException
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown
>  Source)
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown
>  Source)
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.(Unknown
>  Source)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$getClassReader(ClosureCleaner.scala:41)
>   at 
> org.apache.spark.util.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:84)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:107)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:1478)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:288)
>   at ...my Scala 2.11 compiled to Java 8 code calling into spark
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6510) Add Graph#minus method to act as Set#difference

2015-03-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378913#comment-14378913
 ] 

Apache Spark commented on SPARK-6510:
-

User 'brennonyork' has created a pull request for this issue:
https://github.com/apache/spark/pull/5175

> Add Graph#minus method to act as Set#difference
> ---
>
> Key: SPARK-6510
> URL: https://issues.apache.org/jira/browse/SPARK-6510
> Project: Spark
>  Issue Type: Improvement
>Reporter: Brennon York
>
> Right now GraphX does not have a Set#difference method to operate on 
> VertexIds. We do however have a {{diff}} method although that works on 
> values. Given the optimizations of tombstoning already present this method 
> can be implemented in a very efficient manner.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-24 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378877#comment-14378877
 ] 

Yu Ishikawa commented on SPARK-2429:


I got it. Thanks!

> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: clustering
> Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5508) Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API

2015-03-24 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378848#comment-14378848
 ] 

Ryan Blue commented on SPARK-5508:
--

[~yhuai], we've been working to standardize nested types lately so we now have 
a committed spec for how lists and maps should be written. In addition, we've 
identified all of the ways they have been represented in Parquet and set some 
backward-compatibility rules. The Parquet issue is PARQUET-113 and you can look 
at the rules in [the 
spec|https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md].
 I implemented those rules in Hive, parquet-avro, and parquet-thrift, so feel 
free to ping me if you have questions.

> Arrays and Maps stored with Hive Parquet Serde may not be able to read by the 
> Parquet support in the Data Souce API
> ---
>
> Key: SPARK-5508
> URL: https://issues.apache.org/jira/browse/SPARK-5508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
> Environment: mesos, cdh
>Reporter: Ayoub Benali
>Assignee: Cheng Lian
>Priority: Critical
>  Labels: hivecontext, parquet
>
> *The root cause of this bug is explained below ([see 
> here|https://issues.apache.org/jira/secure/EditComment!default.jspa?id=12771559&commentId=14368505]).*
>  
> *The workaround of this issue is to set the following confs*
> {code}
> sql("set spark.sql.parquet.useDataSourceApi=false")
> sql("set spark.sql.hive.convertMetastoreParquet=false")
> {code}
> *Below is the original description.*
> When the table is saved as parquet, we cannot query a field which is an array 
> of struct after an INSERT statement, like show bellow:  
> {noformat}
> scala> val data1="""{
>  | "timestamp": 1422435598,
>  | "data_array": [
>  | {
>  | "field1": 1,
>  | "field2": 2
>  | }
>  | ]
>  | }"""
> scala> val data2="""{
>  | "timestamp": 1422435598,
>  | "data_array": [
>  | {
>  | "field1": 3,
>  | "field2": 4
>  | }
>  | ]
> scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
> scala> val rdd = hiveContext.jsonRDD(jsonRDD)
> scala> rdd.printSchema
> root
>  |-- data_array: array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- field1: integer (nullable = true)
>  |||-- field2: integer (nullable = true)
>  |-- timestamp: integer (nullable = true)
> scala> rdd.registerTempTable("tmp_table")
> scala> hiveContext.sql("select data.field1 from tmp_table LATERAL VIEW 
> explode(data_array) nestedStuff AS data").collect
> res3: Array[org.apache.spark.sql.Row] = Array([1], [3])
> scala> hiveContext.sql("SET hive.exec.dynamic.partition = true")
> scala> hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict")
> scala> hiveContext.sql("set parquet.compression=GZIP")
> scala> hiveContext.setConf("spark.sql.parquet.binaryAsString", "true")
> scala> hiveContext.sql("create external table if not exists 
> persisted_table(data_array ARRAY >, 
> timestamp INT) STORED AS PARQUET Location 'hdfs:///test_table'")
> scala> hiveContext.sql("insert into table persisted_table select * from 
> tmp_table").collect
> scala> hiveContext.sql("select data.field1 from persisted_table LATERAL VIEW 
> explode(data_array) nestedStuff AS data").collect
> parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in 
> file hdfs://*/test_table/part-1
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
>   at 
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
>   at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.c

[jira] [Commented] (SPARK-6481) Set "In Progress" when a PR is opened for an issue

2015-03-24 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378838#comment-14378838
 ] 

Nicholas Chammas commented on SPARK-6481:
-

Since there is no guaranteed way to map GitHub usernames to JIRA usernames, 
what should we do about the JIRA assignee?

A JIRA issue needs an assignee in order to be marked "In Progress". We can have 
the script:
# always assign the issue to the Apache Spark user
# keep it assigned to whoever has it assigned, if any (this may be different 
from the PR user)
# in the case of no current assignee, assign to Apache Spark just to mark the 
JIRA in progress, then remove assignee

Any preferences [~marmbrus] / [~pwendell]?

> Set "In Progress" when a PR is opened for an issue
> --
>
> Key: SPARK-6481
> URL: https://issues.apache.org/jira/browse/SPARK-6481
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Michael Armbrust
>Assignee: Nicholas Chammas
>
> [~pwendell] and I are not sure if this is possible, but it would be really 
> helpful if the JIRA status was updated to "In Progress" when we do the 
> linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6476) Spark fileserver not started on same IP as using spark.driver.host

2015-03-24 Thread Rares Vernica (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rares Vernica resolved SPARK-6476.
--
Resolution: Not a Problem

After investigating more and trying the suggested change, I think this is not a 
problem. The server is not listening to a specific IP so any of the IPs can be 
used to access the server. The spark.fileserver.uri reported lists only one of 
the IPs.

The original problem reported on the mailing list is not caused by this. The 
original problem was due to a misconfiguration of the input file used in the 
Spark job.

> Spark fileserver not started on same IP as using spark.driver.host
> --
>
> Key: SPARK-6476
> URL: https://issues.apache.org/jira/browse/SPARK-6476
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
>Reporter: Rares Vernica
>
> I initially inquired about this here: 
> http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E
> If the Spark driver host has multiple IPs and spark.driver.host is set to one 
> of them, I would expect the fileserver to start on the same IP. I checked 
> HttpServer and the jetty Server is started the default IP of the machine: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75
> Something like this might work instead:
> {code:title=HttpServer.scala#L75}
> val server = new Server(new InetSocketAddress(conf.get("spark.driver.host"), 
> 0))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6515) OpenHashSet returns invalid position when the data size is 1

2015-03-24 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-6515:


 Summary: OpenHashSet returns invalid position when the data size 
is 1
 Key: SPARK-6515
 URL: https://issues.apache.org/jira/browse/SPARK-6515
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.2.1, 1.1.1
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6385) ISO 8601 timestamp parsing does not support arbitrary precision second fractions

2015-03-24 Thread Nick Bruun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378836#comment-14378836
 ] 

Nick Bruun commented on SPARK-6385:
---

Strictly speaking, the ISO 8601 standard does not define a fixed precision for 
decimal fractions of seconds (or minutes or hours for that matter.) Many 
sources of JSON data will output greater than millisecond precision second 
decimal fractions (the "validity" of the precision in terms of reasoning is a 
different matter), so in my opinion, Spark should at least support this (and 
also shorter notations, where trailing zeros have been trimmed), if not the 
entire ISO 8601 date/time standard, although that *is* probably erring on the 
side of pedantic. Alternatively, this could be implemented as a standalone 
library, but that raises the question of library dependencies in Spark.

> ISO 8601 timestamp parsing does not support arbitrary precision second 
> fractions
> 
>
> Key: SPARK-6385
> URL: https://issues.apache.org/jira/browse/SPARK-6385
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Nick Bruun
>Priority: Minor
>
> The ISO 8601 timestamp parsing implemented as a resolution to SPARK-4149 does 
> not support arbitrary precision fractions of seconds, only millisecond 
> precision. Parsing {{2015-02-02T00:00:07.900GMT-00:00}} will succeed, while 
> {{2015-02-02T00:00:07.9000GMT-00:00}} will fail.
> The issue is caused by the fixed precision of the parsed format in 
> [DataTypeConversions.scala#L66|https://github.com/apache/spark/blob/84acd08e0886aa23195f35837c15c09aa7804aff/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataTypeConversions.scala#L66].
>  I'm willing to implement a fix, but pointers on the direction would be 
> appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself

2015-03-24 Thread Chris Fregly (JIRA)
Chris Fregly created SPARK-6514:
---

 Summary: For Kinesis Streaming, use the same region for DynamoDB 
(KCL checkpoints) as the Kinesis stream itself  
 Key: SPARK-6514
 URL: https://issues.apache.org/jira/browse/SPARK-6514
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Chris Fregly


this was not supported when i originally wrote this receiver.

this is now supported.  also, upgrade to the latest Kinesis Client Library 
(KCL) which is 1.2, i believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6513) Add zipWithUniqueId (and other RDD APIs) to RDDApi

2015-03-24 Thread Eran Medan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eran Medan updated SPARK-6513:
--
Description: 
It will be nice if we could treat a Dataframe just like an RDD (wherever it 
makes sense) 

*Worked in 1.2.1*
{code}
 val sqlContext = new HiveContext(sc)
 import sqlContext._
 val jsonRDD = sqlContext.jsonFile(jsonFilePath)
 jsonRDD.registerTempTable("jsonTable")

 val jsonResult = sql(s"select * from jsonTable")
 val foo = jsonResult.zipWithUniqueId().map {
   case (Row(...), uniqueId) => // do something useful
   ...
 }

 foo.registerTempTable("...")

{code}

*Stopped working in 1.3.0* 
{code}   
jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method
{code}

**Not working workaround:**

although this might give me an {{RDD\[Row\]}}:
{code}
jsonResult.rdd.zipWithUniqueId()  
{code}

Now this won't work obviously since {{RDD\[Row\]}} does not have a 
{{registerTempTable}} method of course
{code}
 foo.registerTempTable("...")
{code}

(see related SO question: 
http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)

EDIT: changed from issue to enhancement request 

  was:
It will be nice if we could treat a Dataframe just like an RDD (wherever it 
makes sense) 

*Worked in 1.2.1*
{code}
 val sqlContext = new HiveContext(sc)
 import sqlContext._
 val jsonRDD = sqlContext.jsonFile(jsonFilePath)
 jsonRDD.registerTempTable("jsonTable")

 val jsonResult = sql(s"select * from jsonTable")
 val foo = jsonResult.zipWithUniqueId().map {
   case (Row(...), uniqueId) => // do something useful
   ...
 }

 foo.registerTempTable("...")

{code}

*Stopped working in 1.3.0* 
{code}   
jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method
{code}

**Not working workaround:**

although this might give me an {{RDD\[Row\]}}:
{code}
jsonResult.map(identity).zipWithUniqueId()  
{code}

Now this won't work obviously since {{RDD\[Row\]}} does not have a 
{{registerTempTable}} method of course
{code}
 foo.registerTempTable("...")
{code}

(see related SO question: 
http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)

EDIT: changed from issue to enhancement request 


> Add zipWithUniqueId (and other RDD APIs) to RDDApi
> --
>
> Key: SPARK-6513
> URL: https://issues.apache.org/jira/browse/SPARK-6513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I 
> don't think it's relevant)
>Reporter: Eran Medan
>Priority: Minor
>
> It will be nice if we could treat a Dataframe just like an RDD (wherever it 
> makes sense) 
> *Worked in 1.2.1*
> {code}
>  val sqlContext = new HiveContext(sc)
>  import sqlContext._
>  val jsonRDD = sqlContext.jsonFile(jsonFilePath)
>  jsonRDD.registerTempTable("jsonTable")
>  val jsonResult = sql(s"select * from jsonTable")
>  val foo = jsonResult.zipWithUniqueId().map {
>case (Row(...), uniqueId) => // do something useful
>...
>  }
>  foo.registerTempTable("...")
> {code}
> *Stopped working in 1.3.0* 
> {code}   
> jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method
> {code}
> **Not working workaround:**
> although this might give me an {{RDD\[Row\]}}:
> {code}
> jsonResult.rdd.zipWithUniqueId()  
> {code}
> Now this won't work obviously since {{RDD\[Row\]}} does not have a 
> {{registerTempTable}} method of course
> {code}
>  foo.registerTempTable("...")
> {code}
> (see related SO question: 
> http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)
> EDIT: changed from issue to enhancement request 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6481) Set "In Progress" when a PR is opened for an issue

2015-03-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6481:
---

Assignee: Nicholas Chammas  (was: Apache Spark)

> Set "In Progress" when a PR is opened for an issue
> --
>
> Key: SPARK-6481
> URL: https://issues.apache.org/jira/browse/SPARK-6481
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Michael Armbrust
>Assignee: Nicholas Chammas
>
> [~pwendell] and I are not sure if this is possible, but it would be really 
> helpful if the JIRA status was updated to "In Progress" when we do the 
> linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6385) ISO 8601 timestamp parsing does not support arbitrary precision second fractions

2015-03-24 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378816#comment-14378816
 ] 

Michael Armbrust commented on SPARK-6385:
-

Oh, I see.  Looks like this is actually correct, but not what you want 
http://stackoverflow.com/questions/12000673/string-date-conversion-with-nanoseconds.

Is the format you are describing part of the standard?  I'm not opposed to us 
doing something custom (assuming its well tested) if we have to, but I'd like 
to avoid adding too many non-standard semantics.

> ISO 8601 timestamp parsing does not support arbitrary precision second 
> fractions
> 
>
> Key: SPARK-6385
> URL: https://issues.apache.org/jira/browse/SPARK-6385
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Nick Bruun
>Priority: Minor
>
> The ISO 8601 timestamp parsing implemented as a resolution to SPARK-4149 does 
> not support arbitrary precision fractions of seconds, only millisecond 
> precision. Parsing {{2015-02-02T00:00:07.900GMT-00:00}} will succeed, while 
> {{2015-02-02T00:00:07.9000GMT-00:00}} will fail.
> The issue is caused by the fixed precision of the parsed format in 
> [DataTypeConversions.scala#L66|https://github.com/apache/spark/blob/84acd08e0886aa23195f35837c15c09aa7804aff/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataTypeConversions.scala#L66].
>  I'm willing to implement a fix, but pointers on the direction would be 
> appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6513) Add zipWithUniqueId (and other RDD APIs) to RDDApi

2015-03-24 Thread Eran Medan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eran Medan updated SPARK-6513:
--
Summary: Add zipWithUniqueId (and other RDD APIs) to RDDApi  (was: Add 
zipWithUniqueId (and other RDD APIs) to RDDApi.scala)

> Add zipWithUniqueId (and other RDD APIs) to RDDApi
> --
>
> Key: SPARK-6513
> URL: https://issues.apache.org/jira/browse/SPARK-6513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I 
> don't think it's relevant)
>Reporter: Eran Medan
>Priority: Minor
>
> It will be nice if we could treat a Dataframe just like an RDD (wherever it 
> makes sense) 
> *Worked in 1.2.1*
> {code}
>  val sqlContext = new HiveContext(sc)
>  import sqlContext._
>  val jsonRDD = sqlContext.jsonFile(jsonFilePath)
>  jsonRDD.registerTempTable("jsonTable")
>  val jsonResult = sql(s"select * from jsonTable")
>  val foo = jsonResult.zipWithUniqueId().map {
>case (Row(...), uniqueId) => // do something useful
>...
>  }
>  foo.registerTempTable("...")
> {code}
> *Stopped working in 1.3.0* 
> {code}   
> jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method
> {code}
> **Not working workaround:**
> although this might give me an {{RDD\[Row\]}}:
> {code}
> jsonResult.map(identity).zipWithUniqueId()  
> {code}
> Now this won't work obviously since {{RDD\[Row\]}} does not have a 
> {{registerTempTable}} method of course
> {code}
>  foo.registerTempTable("...")
> {code}
> (see related SO question: 
> http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)
> EDIT: changed from issue to enhancement request 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6513) Add zipWithUniqueId (and other RDD APIs) to RDDApi.scala

2015-03-24 Thread Eran Medan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eran Medan updated SPARK-6513:
--
Description: 
It will be nice if we could treat a Dataframe just like an RDD (wherever it 
makes sense) 

*Worked in 1.2.1*
{code}
 val sqlContext = new HiveContext(sc)
 import sqlContext._
 val jsonRDD = sqlContext.jsonFile(jsonFilePath)
 jsonRDD.registerTempTable("jsonTable")

 val jsonResult = sql(s"select * from jsonTable")
 val foo = jsonResult.zipWithUniqueId().map {
   case (Row(...), uniqueId) => // do something useful
   ...
 }

 foo.registerTempTable("...")

{code}

*Stopped working in 1.3.0* 
{code}   
jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method
{code}

**Not working workaround:**

although this might give me an {{RDD\[Row\]}}:
{code}
jsonResult.map(identity).zipWithUniqueId()  
{code}

Now this won't work obviously since {{RDD\[Row\]}} does not have a 
{{registerTempTable}} method of course
{code}
 foo.registerTempTable("...")
{code}

(see related SO question: 
http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)

EDIT: changed from issue to enhancement request 

  was:
I'm sure this has an Issue somewhere but I can't find it. 

I see this is not a regression bug (since Ap, since it compiled in 1.2.1 but 
stopped in 1.3 without any earlier deprecation warnings, but I am sure the 
authors are well aware, so please change it to an enhancement request if you 
disagree this is a regression. It's such an obvious and blunt regression that I 
doubt it was done without a lot of thought and I'm sure there was a good 
reason, but still it breaks my code and I don't have a workaround :)

Here are the details / steps to reproduce

*Worked in 1.2.1* (without any deprecation warnings)
{code}
 val sqlContext = new HiveContext(sc)
 import sqlContext._
 val jsonRDD = sqlContext.jsonFile(jsonFilePath)
 jsonRDD.registerTempTable("jsonTable")

 val jsonResult = sql(s"select * from jsonTable")
 val foo = jsonResult.zipWithUniqueId().map {
   case (Row(...), uniqueId) => // do something useful
   ...
 }

 foo.registerTempTable("...")

{code}

*Stopped working in 1.3.0* (simply does not compile, and all I did was change 
to 1.3)
{code}   
jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method
{code}

**Not working workaround:**

although this might give me an {{RDD\[Row\]}}:
{code}
jsonResult.map(identity).zipWithUniqueId()  
{code}

Now this won't work obviously since {{RDD\[Row\]}} does not have a 
{{registerTempTable}} method of course
{code}
 foo.registerTempTable("...")
{code}

(see related SO question: 
http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)




> Add zipWithUniqueId (and other RDD APIs) to RDDApi.scala
> 
>
> Key: SPARK-6513
> URL: https://issues.apache.org/jira/browse/SPARK-6513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I 
> don't think it's relevant)
>Reporter: Eran Medan
>Priority: Minor
>
> It will be nice if we could treat a Dataframe just like an RDD (wherever it 
> makes sense) 
> *Worked in 1.2.1*
> {code}
>  val sqlContext = new HiveContext(sc)
>  import sqlContext._
>  val jsonRDD = sqlContext.jsonFile(jsonFilePath)
>  jsonRDD.registerTempTable("jsonTable")
>  val jsonResult = sql(s"select * from jsonTable")
>  val foo = jsonResult.zipWithUniqueId().map {
>case (Row(...), uniqueId) => // do something useful
>...
>  }
>  foo.registerTempTable("...")
> {code}
> *Stopped working in 1.3.0* 
> {code}   
> jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method
> {code}
> **Not working workaround:**
> although this might give me an {{RDD\[Row\]}}:
> {code}
> jsonResult.map(identity).zipWithUniqueId()  
> {code}
> Now this won't work obviously since {{RDD\[Row\]}} does not have a 
> {{registerTempTable}} method of course
> {code}
>  foo.registerTempTable("...")
> {code}
> (see related SO question: 
> http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)
> EDIT: changed from issue to enhancement request 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6513) Regression - missing zipWithUniqueId (and other RDD APIs) in RDDApi.scala

2015-03-24 Thread Eran Medan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eran Medan updated SPARK-6513:
--
Description: 
I'm sure this has an Issue somewhere but I can't find it. 

I see this is not a regression bug (since Ap, since it compiled in 1.2.1 but 
stopped in 1.3 without any earlier deprecation warnings, but I am sure the 
authors are well aware, so please change it to an enhancement request if you 
disagree this is a regression. It's such an obvious and blunt regression that I 
doubt it was done without a lot of thought and I'm sure there was a good 
reason, but still it breaks my code and I don't have a workaround :)

Here are the details / steps to reproduce

*Worked in 1.2.1* (without any deprecation warnings)
{code}
 val sqlContext = new HiveContext(sc)
 import sqlContext._
 val jsonRDD = sqlContext.jsonFile(jsonFilePath)
 jsonRDD.registerTempTable("jsonTable")

 val jsonResult = sql(s"select * from jsonTable")
 val foo = jsonResult.zipWithUniqueId().map {
   case (Row(...), uniqueId) => // do something useful
   ...
 }

 foo.registerTempTable("...")

{code}

*Stopped working in 1.3.0* (simply does not compile, and all I did was change 
to 1.3)
{code}   
jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method
{code}

**Not working workaround:**

although this might give me an {{RDD\[Row\]}}:
{code}
jsonResult.map(identity).zipWithUniqueId()  
{code}

Now this won't work obviously since {{RDD\[Row\]}} does not have a 
{{registerTempTable}} method of course
{code}
 foo.registerTempTable("...")
{code}

(see related SO question: 
http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)



  was:
I'm sure this has an Issue somewhere but I can't find it. 

I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 
without any earlier deprecation warnings, but I am sure the authors are well 
aware, so please change it to an enhancement request if you disagree this is a 
regression. It's such an obvious and blunt regression that I doubt it was done 
without a lot of thought and I'm sure there was a good reason, but still it 
breaks my code and I don't have a workaround :)

Here are the details / steps to reproduce

*Worked in 1.2.1* (without any deprecation warnings)
{code}
 val sqlContext = new HiveContext(sc)
 import sqlContext._
 val jsonRDD = sqlContext.jsonFile(jsonFilePath)
 jsonRDD.registerTempTable("jsonTable")

 val jsonResult = sql(s"select * from jsonTable")
 val foo = jsonResult.zipWithUniqueId().map {
   case (Row(...), uniqueId) => // do something useful
   ...
 }

 foo.registerTempTable("...")

{code}

*Stopped working in 1.3.0* (simply does not compile, and all I did was change 
to 1.3)
{code}   
jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method
{code}

**Not working workaround:**

although this might give me an {{RDD\[Row\]}}:
{code}
jsonResult.map(identity).zipWithUniqueId()  
{code}

Now this won't work obviously since {{RDD\[Row\]}} does not have a 
{{registerTempTable}} method of course
{code}
 foo.registerTempTable("...")
{code}

(see related SO question: 
http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)




> Regression - missing zipWithUniqueId (and other RDD APIs) in RDDApi.scala
> -
>
> Key: SPARK-6513
> URL: https://issues.apache.org/jira/browse/SPARK-6513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I 
> don't think it's relevant)
>Reporter: Eran Medan
>Priority: Minor
>
> I'm sure this has an Issue somewhere but I can't find it. 
> I see this is not a regression bug (since Ap, since it compiled in 1.2.1 but 
> stopped in 1.3 without any earlier deprecation warnings, but I am sure the 
> authors are well aware, so please change it to an enhancement request if you 
> disagree this is a regression. It's such an obvious and blunt regression that 
> I doubt it was done without a lot of thought and I'm sure there was a good 
> reason, but still it breaks my code and I don't have a workaround :)
> Here are the details / steps to reproduce
> *Worked in 1.2.1* (without any deprecation warnings)
> {code}
>  val sqlContext = new HiveContext(sc)
>  import sqlContext._
>  val jsonRDD = sqlContext.jsonFile(jsonFilePath)
>  jsonRDD.registerTempTable("jsonTable")
>  val jsonResult = sql(s"select * from jsonTable")
>  val foo = jsonResult.zipWithUniqueId().map {
>case (Row(...), uniqueId) => // do something useful
>...
>  }
>  foo.registerTempTable("...

[jira] [Updated] (SPARK-6513) Regression - missing zipWithUniqueId (and other RDD APIs) in RDDApi.scala

2015-03-24 Thread Eran Medan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eran Medan updated SPARK-6513:
--
Issue Type: Improvement  (was: Bug)

> Regression - missing zipWithUniqueId (and other RDD APIs) in RDDApi.scala
> -
>
> Key: SPARK-6513
> URL: https://issues.apache.org/jira/browse/SPARK-6513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I 
> don't think it's relevant)
>Reporter: Eran Medan
>Priority: Blocker
>
> I'm sure this has an Issue somewhere but I can't find it. 
> I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 
> without any earlier deprecation warnings, but I am sure the authors are well 
> aware, so please change it to an enhancement request if you disagree this is 
> a regression. It's such an obvious and blunt regression that I doubt it was 
> done without a lot of thought and I'm sure there was a good reason, but still 
> it breaks my code and I don't have a workaround :)
> Here are the details / steps to reproduce
> *Worked in 1.2.1* (without any deprecation warnings)
> {code}
>  val sqlContext = new HiveContext(sc)
>  import sqlContext._
>  val jsonRDD = sqlContext.jsonFile(jsonFilePath)
>  jsonRDD.registerTempTable("jsonTable")
>  val jsonResult = sql(s"select * from jsonTable")
>  val foo = jsonResult.zipWithUniqueId().map {
>case (Row(...), uniqueId) => // do something useful
>...
>  }
>  foo.registerTempTable("...")
> {code}
> *Stopped working in 1.3.0* (simply does not compile, and all I did was change 
> to 1.3)
> {code}   
> jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method
> {code}
> **Not working workaround:**
> although this might give me an {{RDD\[Row\]}}:
> {code}
> jsonResult.map(identity).zipWithUniqueId()  
> {code}
> Now this won't work obviously since {{RDD\[Row\]}} does not have a 
> {{registerTempTable}} method of course
> {code}
>  foo.registerTempTable("...")
> {code}
> (see related SO question: 
> http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6513) Add zipWithUniqueId (and other RDD APIs) to RDDApi.scala

2015-03-24 Thread Eran Medan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eran Medan updated SPARK-6513:
--
Summary: Add zipWithUniqueId (and other RDD APIs) to RDDApi.scala  (was: 
Regression - missing zipWithUniqueId (and other RDD APIs) in RDDApi.scala)

> Add zipWithUniqueId (and other RDD APIs) to RDDApi.scala
> 
>
> Key: SPARK-6513
> URL: https://issues.apache.org/jira/browse/SPARK-6513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I 
> don't think it's relevant)
>Reporter: Eran Medan
>Priority: Minor
>
> I'm sure this has an Issue somewhere but I can't find it. 
> I see this is not a regression bug (since Ap, since it compiled in 1.2.1 but 
> stopped in 1.3 without any earlier deprecation warnings, but I am sure the 
> authors are well aware, so please change it to an enhancement request if you 
> disagree this is a regression. It's such an obvious and blunt regression that 
> I doubt it was done without a lot of thought and I'm sure there was a good 
> reason, but still it breaks my code and I don't have a workaround :)
> Here are the details / steps to reproduce
> *Worked in 1.2.1* (without any deprecation warnings)
> {code}
>  val sqlContext = new HiveContext(sc)
>  import sqlContext._
>  val jsonRDD = sqlContext.jsonFile(jsonFilePath)
>  jsonRDD.registerTempTable("jsonTable")
>  val jsonResult = sql(s"select * from jsonTable")
>  val foo = jsonResult.zipWithUniqueId().map {
>case (Row(...), uniqueId) => // do something useful
>...
>  }
>  foo.registerTempTable("...")
> {code}
> *Stopped working in 1.3.0* (simply does not compile, and all I did was change 
> to 1.3)
> {code}   
> jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method
> {code}
> **Not working workaround:**
> although this might give me an {{RDD\[Row\]}}:
> {code}
> jsonResult.map(identity).zipWithUniqueId()  
> {code}
> Now this won't work obviously since {{RDD\[Row\]}} does not have a 
> {{registerTempTable}} method of course
> {code}
>  foo.registerTempTable("...")
> {code}
> (see related SO question: 
> http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6513) Regression - missing zipWithUniqueId (and other RDD APIs) in RDDApi.scala

2015-03-24 Thread Eran Medan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eran Medan updated SPARK-6513:
--
Priority: Minor  (was: Blocker)

> Regression - missing zipWithUniqueId (and other RDD APIs) in RDDApi.scala
> -
>
> Key: SPARK-6513
> URL: https://issues.apache.org/jira/browse/SPARK-6513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I 
> don't think it's relevant)
>Reporter: Eran Medan
>Priority: Minor
>
> I'm sure this has an Issue somewhere but I can't find it. 
> I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 
> without any earlier deprecation warnings, but I am sure the authors are well 
> aware, so please change it to an enhancement request if you disagree this is 
> a regression. It's such an obvious and blunt regression that I doubt it was 
> done without a lot of thought and I'm sure there was a good reason, but still 
> it breaks my code and I don't have a workaround :)
> Here are the details / steps to reproduce
> *Worked in 1.2.1* (without any deprecation warnings)
> {code}
>  val sqlContext = new HiveContext(sc)
>  import sqlContext._
>  val jsonRDD = sqlContext.jsonFile(jsonFilePath)
>  jsonRDD.registerTempTable("jsonTable")
>  val jsonResult = sql(s"select * from jsonTable")
>  val foo = jsonResult.zipWithUniqueId().map {
>case (Row(...), uniqueId) => // do something useful
>...
>  }
>  foo.registerTempTable("...")
> {code}
> *Stopped working in 1.3.0* (simply does not compile, and all I did was change 
> to 1.3)
> {code}   
> jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method
> {code}
> **Not working workaround:**
> although this might give me an {{RDD\[Row\]}}:
> {code}
> jsonResult.map(identity).zipWithUniqueId()  
> {code}
> Now this won't work obviously since {{RDD\[Row\]}} does not have a 
> {{registerTempTable}} method of course
> {code}
>  foo.registerTempTable("...")
> {code}
> (see related SO question: 
> http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6385) ISO 8601 timestamp parsing does not support arbitrary precision second fractions

2015-03-24 Thread Nick Bruun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378801#comment-14378801
 ] 

Nick Bruun commented on SPARK-6385:
---

An extra {{S}} does not seem to do the trick, as the resulting date ({{res2}}) 
is incorrect ({{:16}} rather than {{:07}}.) I've looked through a series of 
libraries, and all seem to be doing it in the same way ({{SSS}} and that's it), 
so I'm considering writing a proper parser instead. What is the position on 
having this level of complexity in Spark?

> ISO 8601 timestamp parsing does not support arbitrary precision second 
> fractions
> 
>
> Key: SPARK-6385
> URL: https://issues.apache.org/jira/browse/SPARK-6385
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Nick Bruun
>Priority: Minor
>
> The ISO 8601 timestamp parsing implemented as a resolution to SPARK-4149 does 
> not support arbitrary precision fractions of seconds, only millisecond 
> precision. Parsing {{2015-02-02T00:00:07.900GMT-00:00}} will succeed, while 
> {{2015-02-02T00:00:07.9000GMT-00:00}} will fail.
> The issue is caused by the fixed precision of the parsed format in 
> [DataTypeConversions.scala#L66|https://github.com/apache/spark/blob/84acd08e0886aa23195f35837c15c09aa7804aff/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataTypeConversions.scala#L66].
>  I'm willing to implement a fix, but pointers on the direction would be 
> appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6481) Set "In Progress" when a PR is opened for an issue

2015-03-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6481:
---

Assignee: Apache Spark  (was: Nicholas Chammas)

> Set "In Progress" when a PR is opened for an issue
> --
>
> Key: SPARK-6481
> URL: https://issues.apache.org/jira/browse/SPARK-6481
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>
> [~pwendell] and I are not sure if this is possible, but it would be really 
> helpful if the JIRA status was updated to "In Progress" when we do the 
> linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6209) ExecutorClassLoader can leak connections after failing to load classes from the REPL class server

2015-03-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378788#comment-14378788
 ] 

Apache Spark commented on SPARK-6209:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5174

> ExecutorClassLoader can leak connections after failing to load classes from 
> the REPL class server
> -
>
> Key: SPARK-6209
> URL: https://issues.apache.org/jira/browse/SPARK-6209
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.3, 1.1.2, 1.2.1, 1.3.0, 1.4.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.1, 1.4.0
>
>
> ExecutorClassLoader does not ensure proper cleanup of network connections 
> that it opens.  If it fails to load a class, it may leak partially-consumed 
> InputStreams that are connected to the REPL's HTTP class server, causing that 
> server to exhaust its thread pool, which can cause the entire job to hang.
> Here is a simple reproduction:
> With
> {code}
> ./bin/spark-shell --master local-cluster[8,8,512] 
> {code}
> run the following command:
> {code}
> sc.parallelize(1 to 1000, 1000).map { x =>
>   try {
>   Class.forName("some.class.that.does.not.Exist")
>   } catch {
>   case e: Exception => // do nothing
>   }
>   x
> }.count()
> {code}
> This job will run 253 tasks, then will completely freeze without any errors 
> or failed tasks.
> It looks like the driver has 253 threads blocked in socketRead0() calls:
> {code}
> [joshrosen ~]$ jstack 16765 | grep socketRead0 | wc
>  253 759   14674
> {code}
> e.g.
> {code}
> "qtp1287429402-13" daemon prio=5 tid=0x7f868a1c nid=0x5b03 runnable 
> [0x0001159bd000]
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:152)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
> at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
> at 
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
> at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1044)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280)
> at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> at 
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> at 
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Thread.java:745) 
> {code}
> Jstack on the executors shows blocking in loadClass / findClass, where a 
> single thread is RUNNABLE and waiting to hear back from the driver and other 
> executor threads are BLOCKED on object monitor synchronization at 
> Class.forName0().
> Remotely triggering a GC on a hanging executor allows the job to progress and 
> complete more tasks before hanging again.  If I repeatedly trigger GC on all 
> of the executors, then the job runs to completion:
> {code}
> jps | grep CoarseGra | cut -d ' ' -f 1 | xargs -I {} -n 1 -P100 jcmd {} GC.run
> {code}
> The culprit is a {{catch}} block that ignores all exceptions and performs no 
> cleanup: 
> https://github.com/apache/spark/blob/v1.2.0/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L94
> This bug has been present since Spark 1.0.0, but I suspect that we haven't 
> seen it before because it's pretty hard to reproduce. Triggering this error 
> requires a job with tasks that trigger ClassNotFoundExceptions yet are still 
> able to run to completion.  It also requires that executors are able to leak 
> enough open connections to exhaust the class server's Jetty thread pool 
> limit, which requires that there are a large number of tasks (253+) and 
> either a large number of executors or a very low amount of GC pressure on 
> those executors (since GC will cause the leaked connections to be closed).
> The fix here is pretty simple: add proper resource cleanup to this class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6513) Regression - missing zipWithUniqueId (and other RDD APIs) in RDDApi.scala

2015-03-24 Thread Eran Medan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eran Medan updated SPARK-6513:
--
Summary: Regression - missing zipWithUniqueId (and other RDD APIs) in 
RDDApi.scala  (was: Regression - Adding zipWithUniqueId (and other missing RDD 
APIs) to RDDApi.scala)

> Regression - missing zipWithUniqueId (and other RDD APIs) in RDDApi.scala
> -
>
> Key: SPARK-6513
> URL: https://issues.apache.org/jira/browse/SPARK-6513
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I 
> don't think it's relevant)
>Reporter: Eran Medan
>Priority: Blocker
>
> I'm sure this has an Issue somewhere but I can't find it. 
> I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 
> without any earlier deprecation warnings, but I am sure the authors are well 
> aware, so please change it to an enhancement request if you disagree this is 
> a regression. It's such an obvious and blunt regression that I doubt it was 
> done without a lot of thought and I'm sure there was a good reason, but still 
> it breaks my code and I don't have a workaround :)
> Here are the details / steps to reproduce
> *Worked in 1.2.1* (without any deprecation warnings)
> {code}
>  val sqlContext = new HiveContext(sc)
>  import sqlContext._
>  val jsonRDD = sqlContext.jsonFile(jsonFilePath)
>  jsonRDD.registerTempTable("jsonTable")
>  val jsonResult = sql(s"select * from jsonTable")
>  val foo = jsonResult.zipWithUniqueId().map {
>case (Row(...), uniqueId) => // do something useful
>...
>  }
>  foo.registerTempTable("...")
> {code}
> *Stopped working in 1.3.0* (simply does not compile, and all I did was change 
> to 1.3)
> {code}   
> jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method
> {code}
> **Not working workaround:**
> although this might give me an {{RDD\[Row\]}}:
> {code}
> jsonResult.map(identity).zipWithUniqueId()  
> {code}
> Now this won't work obviously since {{RDD\[Row\]}} does not have a 
> {{registerTempTable}} method of course
> {code}
>  foo.registerTempTable("...")
> {code}
> (see related SO question: 
> http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6513) Regression - Adding zipWithUniqueId (and other missing RDD APIs) to RDDApi.scala

2015-03-24 Thread Eran Medan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eran Medan updated SPARK-6513:
--
Description: 
I'm sure this has an Issue somewhere but I can't find it. 

I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 
without any earlier deprecation warnings, but I am sure the authors are well 
aware, so please change it to an enhancement request if you disagree this is a 
regression. It's such an obvious and blunt regression that I doubt it was done 
without a lot of thought and I'm sure there was a good reason, but still it 
breaks my code and I don't have a workaround :)

Here are the details / steps to reproduce

*Worked in 1.2.1* (without any deprecation warnings)
{code}
 val sqlContext = new HiveContext(sc)
 import sqlContext._
 val jsonRDD = sqlContext.jsonFile(jsonFilePath)
 jsonRDD.registerTempTable("jsonTable")

 val jsonResult = sql(s"select * from jsonTable")
 val foo = jsonResult.zipWithUniqueId().map {
   case (Row(...), uniqueId) => // do something useful
   ...
 }

 foo.registerTempTable("...")

{code}

*Stopped working in 1.3.0* (simply does not compile, and all I did was change 
to 1.3)
{code}   
jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method
{code}

**Not working workaround:**

although this might give me an {{RDD\[Row\]}}:
{code}
jsonResult.map(identity).zipWithUniqueId()  
{code}

Now this won't work obviously since {{RDD\[Row\]}} does not have a 
{{registerTempTable}} method of course
{code}
 foo.registerTempTable("...")
{code}

(see related SO question: 
http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)



  was:
I'm sure this has an Issue somewhere but I can't find it. 

I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 
without any earlier deprecation warnings, but I am sure the authors are well 
aware, so please change it to an enhancement request if you disagree this is a 
regression. It's such an obvious and blunt regression that I doubt it was done 
without a lot of thought and I'm sure there was a good reason, but still it 
breaks my code and I don't have a workaround :)

Here are the details / steps to reproduce

**Worked in 1.2.1** (without any deprecation warnings)

 val sqlContext = new HiveContext(sc)
 import sqlContext._
 val jsonRDD = sqlContext.jsonFile(jsonFilePath)
 jsonRDD.registerTempTable("jsonTable")

 val jsonResult = sql(s"select * from jsonTable")
 val foo = jsonResult.zipWithUniqueId().map {
   case (Row(...), uniqueId) => // do something useful
   ...
 }

 foo.registerTempTable("...")

**Stopped working in 1.3.0** (simply does not compile, and all I did was change 
to 1.3)
   
jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method


**Not working workaround:**

although this might give me an RDD[Row]:

jsonResult.map(identity).zipWithUniqueId()  

now this won't work as `RDD[Row]` does not have a `registerTempTable` method of 
course

 foo.registerTempTable("...")


(see related SO question: 
http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)




> Regression - Adding zipWithUniqueId (and other missing RDD APIs) to 
> RDDApi.scala
> 
>
> Key: SPARK-6513
> URL: https://issues.apache.org/jira/browse/SPARK-6513
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I 
> don't think it's relevant)
>Reporter: Eran Medan
>Priority: Blocker
>
> I'm sure this has an Issue somewhere but I can't find it. 
> I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 
> without any earlier deprecation warnings, but I am sure the authors are well 
> aware, so please change it to an enhancement request if you disagree this is 
> a regression. It's such an obvious and blunt regression that I doubt it was 
> done without a lot of thought and I'm sure there was a good reason, but still 
> it breaks my code and I don't have a workaround :)
> Here are the details / steps to reproduce
> *Worked in 1.2.1* (without any deprecation warnings)
> {code}
>  val sqlContext = new HiveContext(sc)
>  import sqlContext._
>  val jsonRDD = sqlContext.jsonFile(jsonFilePath)
>  jsonRDD.registerTempTable("jsonTable")
>  val jsonResult = sql(s"select * from jsonTable")
>  val foo = jsonResult.zipWithUniqueId().map {
>case (Row(...), uniqueId) => // do something useful
>...
>  }
>  foo.registerTempTable("...")
> {code}
> *Stopped working in 1.3.0* (simply does not compile, and all I did was chan

[jira] [Updated] (SPARK-6513) Regression - Adding zipWithUniqueId (and other missing RDD APIs) to RDDApi.scala

2015-03-24 Thread Eran Medan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eran Medan updated SPARK-6513:
--
Priority: Blocker  (was: Major)

> Regression - Adding zipWithUniqueId (and other missing RDD APIs) to 
> RDDApi.scala
> 
>
> Key: SPARK-6513
> URL: https://issues.apache.org/jira/browse/SPARK-6513
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I 
> don't think it's relevant)
>Reporter: Eran Medan
>Priority: Blocker
>
> I'm sure this has an Issue somewhere but I can't find it. 
> I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 
> without any earlier deprecation warnings, but I am sure the authors are well 
> aware, so please change it to an enhancement request if you disagree this is 
> a regression. It's such an obvious and blunt regression that I doubt it was 
> done without a lot of thought and I'm sure there was a good reason, but still 
> it breaks my code and I don't have a workaround :)
> Here are the details / steps to reproduce
> **Worked in 1.2.1** (without any deprecation warnings)
>  val sqlContext = new HiveContext(sc)
>  import sqlContext._
>  val jsonRDD = sqlContext.jsonFile(jsonFilePath)
>  jsonRDD.registerTempTable("jsonTable")
>  val jsonResult = sql(s"select * from jsonTable")
>  val foo = jsonResult.zipWithUniqueId().map {
>case (Row(...), uniqueId) => // do something useful
>...
>  }
>  foo.registerTempTable("...")
> **Stopped working in 1.3.0** (simply does not compile, and all I did was 
> change to 1.3)
>
> jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method
> **Not working workaround:**
> although this might give me an RDD[Row]:
> jsonResult.map(identity).zipWithUniqueId()  
> now this won't work as `RDD[Row]` does not have a `registerTempTable` method 
> of course
>  foo.registerTempTable("...")
> (see related SO question: 
> http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6079) Use index to speed up StatusTracker.getJobIdsForGroup()

2015-03-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6079:
-
Affects Version/s: 1.3.0

> Use index to speed up StatusTracker.getJobIdsForGroup()
> ---
>
> Key: SPARK-6079
> URL: https://issues.apache.org/jira/browse/SPARK-6079
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
> Fix For: 1.4.0
>
>
> {{StatusTracker.getJobIdsForGroup()}} is implemented via a linear scan over a 
> HashMap rather than using an index.  This might be an expensive operation if 
> there are many (e.g. thousands) of retained jobs.  We can add a new index to 
> speed this up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6513) Regression - Adding zipWithUniqueId (and other missing RDD APIs) to RDDApi.scala

2015-03-24 Thread Eran Medan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eran Medan updated SPARK-6513:
--
Summary: Regression - Adding zipWithUniqueId (and other missing RDD APIs) 
to RDDApi.scala  (was: Regression Adding zipWithUniqueId (and other missing RDD 
APIs) to RDDApi.scala)

> Regression - Adding zipWithUniqueId (and other missing RDD APIs) to 
> RDDApi.scala
> 
>
> Key: SPARK-6513
> URL: https://issues.apache.org/jira/browse/SPARK-6513
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I 
> don't think it's relevant)
>Reporter: Eran Medan
>
> I'm sure this has an Issue somewhere but I can't find it. 
> I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 
> without any earlier deprecation warnings, but I am sure the authors are well 
> aware, so please change it to an enhancement request if you disagree this is 
> a regression. It's such an obvious and blunt regression that I doubt it was 
> done without a lot of thought and I'm sure there was a good reason, but still 
> it breaks my code and I don't have a workaround :)
> Here are the details / steps to reproduce
> **Worked in 1.2.1** (without any deprecation warnings)
>  val sqlContext = new HiveContext(sc)
>  import sqlContext._
>  val jsonRDD = sqlContext.jsonFile(jsonFilePath)
>  jsonRDD.registerTempTable("jsonTable")
>  val jsonResult = sql(s"select * from jsonTable")
>  val foo = jsonResult.zipWithUniqueId().map {
>case (Row(...), uniqueId) => // do something useful
>...
>  }
>  foo.registerTempTable("...")
> **Stopped working in 1.3.0** (simply does not compile, and all I did was 
> change to 1.3)
>
> jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method
> **Not working workaround:**
> although this might give me an RDD[Row]:
> jsonResult.map(identity).zipWithUniqueId()  
> now this won't work as `RDD[Row]` does not have a `registerTempTable` method 
> of course
>  foo.registerTempTable("...")
> (see related SO question: 
> http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6513) Regression Adding zipWithUniqueId (and other missing RDD APIs) to RDDApi.scala

2015-03-24 Thread Eran Medan (JIRA)
Eran Medan created SPARK-6513:
-

 Summary: Regression Adding zipWithUniqueId (and other missing RDD 
APIs) to RDDApi.scala
 Key: SPARK-6513
 URL: https://issues.apache.org/jira/browse/SPARK-6513
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: Windows 7 64bit, Scala 2.11.6, JDK 1.7.0_21 (though I 
don't think it's relevant)
Reporter: Eran Medan


I'm sure this has an Issue somewhere but I can't find it. 

I see this as a regression bug, since it compiled in 1.2.1 but stopped in 1.3 
without any earlier deprecation warnings, but I am sure the authors are well 
aware, so please change it to an enhancement request if you disagree this is a 
regression. It's such an obvious and blunt regression that I doubt it was done 
without a lot of thought and I'm sure there was a good reason, but still it 
breaks my code and I don't have a workaround :)

Here are the details / steps to reproduce

**Worked in 1.2.1** (without any deprecation warnings)

 val sqlContext = new HiveContext(sc)
 import sqlContext._
 val jsonRDD = sqlContext.jsonFile(jsonFilePath)
 jsonRDD.registerTempTable("jsonTable")

 val jsonResult = sql(s"select * from jsonTable")
 val foo = jsonResult.zipWithUniqueId().map {
   case (Row(...), uniqueId) => // do something useful
   ...
 }

 foo.registerTempTable("...")

**Stopped working in 1.3.0** (simply does not compile, and all I did was change 
to 1.3)
   
jsonResult.zipWithUniqueId() //since RDDApi doesn't implement that method


**Not working workaround:**

although this might give me an RDD[Row]:

jsonResult.map(identity).zipWithUniqueId()  

now this won't work as `RDD[Row]` does not have a `registerTempTable` method of 
course

 foo.registerTempTable("...")


(see related SO question: 
http://stackoverflow.com/questions/29243186/is-this-a-regression-bug-in-spark-1-3)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6209) ExecutorClassLoader can leak connections after failing to load classes from the REPL class server

2015-03-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6209:
-
Fix Version/s: 1.4.0
   1.3.1

> ExecutorClassLoader can leak connections after failing to load classes from 
> the REPL class server
> -
>
> Key: SPARK-6209
> URL: https://issues.apache.org/jira/browse/SPARK-6209
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.3, 1.1.2, 1.2.1, 1.3.0, 1.4.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.1, 1.4.0
>
>
> ExecutorClassLoader does not ensure proper cleanup of network connections 
> that it opens.  If it fails to load a class, it may leak partially-consumed 
> InputStreams that are connected to the REPL's HTTP class server, causing that 
> server to exhaust its thread pool, which can cause the entire job to hang.
> Here is a simple reproduction:
> With
> {code}
> ./bin/spark-shell --master local-cluster[8,8,512] 
> {code}
> run the following command:
> {code}
> sc.parallelize(1 to 1000, 1000).map { x =>
>   try {
>   Class.forName("some.class.that.does.not.Exist")
>   } catch {
>   case e: Exception => // do nothing
>   }
>   x
> }.count()
> {code}
> This job will run 253 tasks, then will completely freeze without any errors 
> or failed tasks.
> It looks like the driver has 253 threads blocked in socketRead0() calls:
> {code}
> [joshrosen ~]$ jstack 16765 | grep socketRead0 | wc
>  253 759   14674
> {code}
> e.g.
> {code}
> "qtp1287429402-13" daemon prio=5 tid=0x7f868a1c nid=0x5b03 runnable 
> [0x0001159bd000]
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:152)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
> at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
> at 
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
> at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1044)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280)
> at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> at 
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> at 
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Thread.java:745) 
> {code}
> Jstack on the executors shows blocking in loadClass / findClass, where a 
> single thread is RUNNABLE and waiting to hear back from the driver and other 
> executor threads are BLOCKED on object monitor synchronization at 
> Class.forName0().
> Remotely triggering a GC on a hanging executor allows the job to progress and 
> complete more tasks before hanging again.  If I repeatedly trigger GC on all 
> of the executors, then the job runs to completion:
> {code}
> jps | grep CoarseGra | cut -d ' ' -f 1 | xargs -I {} -n 1 -P100 jcmd {} GC.run
> {code}
> The culprit is a {{catch}} block that ignores all exceptions and performs no 
> cleanup: 
> https://github.com/apache/spark/blob/v1.2.0/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L94
> This bug has been present since Spark 1.0.0, but I suspect that we haven't 
> seen it before because it's pretty hard to reproduce. Triggering this error 
> requires a job with tasks that trigger ClassNotFoundExceptions yet are still 
> able to run to completion.  It also requires that executors are able to leak 
> enough open connections to exhaust the class server's Jetty thread pool 
> limit, which requires that there are a large number of tasks (253+) and 
> either a large number of executors or a very low amount of GC pressure on 
> those executors (since GC will cause the leaked connections to be closed).
> The fix here is pretty simple: add proper resource cleanup to this class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6385) ISO 8601 timestamp parsing does not support arbitrary precision second fractions

2015-03-24 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378689#comment-14378689
 ] 

Michael Armbrust commented on SPARK-6385:
-

[~bruun] I think all you need to do is add another S to the format spec.

{code}
scala>   val ISO8601GMT: SimpleDateFormat = new SimpleDateFormat( 
"-MM-dd'T'HH:mm:ss.SSSz" )
ISO8601GMT: java.text.SimpleDateFormat = java.text.SimpleDateFormat@8a9df61b

scala> ISO8601GMT.parse("2015-02-02T00:00:07.900GMT-00:00")
res0: java.util.Date = Sun Feb 01 16:00:07 PST 2015

scala> ISO8601GMT.parse("2015-02-02T00:00:07.9000GMT-00:00")
java.text.ParseException: Unparseable date: "2015-02-02T00:00:07.9000GMT-00:00"
at java.text.DateFormat.parse(DateFormat.java:357)
at .(:10)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983)
at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568)
at scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:756)
at 
scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:801)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:713)
at scala.tools.nsc.interpreter.ILoop.processLine$1(ILoop.scala:577)
at scala.tools.nsc.interpreter.ILoop.innerLoop$1(ILoop.scala:584)
at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:587)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:878)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:833)
at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:833)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:833)
at 
scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:83)
at scala.tools.nsc.MainGenericRunner.process(MainGenericRunner.scala:96)
at scala.tools.nsc.MainGenericRunner$.main(MainGenericRunner.scala:105)
at scala.tools.nsc.MainGenericRunner.main(MainGenericRunner.scala)


scala>   val ISO8601GMT: SimpleDateFormat = new SimpleDateFormat( 
"-MM-dd'T'HH:mm:ss.z" )
ISO8601GMT: java.text.SimpleDateFormat = java.text.SimpleDateFormat@c920c906

scala> ISO8601GMT.parse("2015-02-02T00:00:07.9000GMT-00:00")
res2: java.util.Date = Sun Feb 01 16:00:16 PST 2015

scala> ISO8601GMT.parse("2015-02-02T00:00:07.900GMT-00:00")
res3: java.util.Date = Sun Feb 01 16:00:07 PST 2015
{code}

> ISO 8601 timestamp parsing does not support arbitrary precision second 
> fractions
> 
>
> Key: SPARK-6385
> URL: https://issues.apache.org/jira/browse/SPARK-6385
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Nick Bruun
>Priority: Minor
>
> The ISO 8601 timestamp parsing implemented as a resolution to SPARK-4149 does 
> not support arbitrary precision fractions of seconds, only millisecond 
> precision. Parsing {{2015-02-02T00:00:07.900GMT-00:00}} will succeed, while 
> {{2015-02-02T00:00:07.9000GMT-00:00}} will fail.
> The issue is caused by the fixed precision of the parsed format in 
> [DataTypeConversions.scala#L66|https://github.com/apache/spark/blob/84acd08e0886aa23195f35837c15c09aa7804aff/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataTypeConversions.scala#L66].
>  I'm willing to implement a fix, but pointers on the direction would be 
> appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6385) ISO 8601 timestamp parsing does not support arbitrary precision second fractions

2015-03-24 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378690#comment-14378690
 ] 

Michael Armbrust commented on SPARK-6385:
-

A PR would be great.  Let me know if you have questions.

> ISO 8601 timestamp parsing does not support arbitrary precision second 
> fractions
> 
>
> Key: SPARK-6385
> URL: https://issues.apache.org/jira/browse/SPARK-6385
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Nick Bruun
>Priority: Minor
>
> The ISO 8601 timestamp parsing implemented as a resolution to SPARK-4149 does 
> not support arbitrary precision fractions of seconds, only millisecond 
> precision. Parsing {{2015-02-02T00:00:07.900GMT-00:00}} will succeed, while 
> {{2015-02-02T00:00:07.9000GMT-00:00}} will fail.
> The issue is caused by the fixed precision of the parsed format in 
> [DataTypeConversions.scala#L66|https://github.com/apache/spark/blob/84acd08e0886aa23195f35837c15c09aa7804aff/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataTypeConversions.scala#L66].
>  I'm willing to implement a fix, but pointers on the direction would be 
> appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6380) Resolution of equi-join key in post-join projection

2015-03-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6380:

Target Version/s: 1.4.0

> Resolution of equi-join key in post-join projection
> ---
>
> Key: SPARK-6380
> URL: https://issues.apache.org/jira/browse/SPARK-6380
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> {code}
> df1.join(df2, df1("key") === df2("key")).select("key")
> {code}
> It would be great to just resolve key to df1("key") in the case of inner 
> joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >