[jira] [Resolved] (SPARK-12655) GraphX does not unpersist RDDs

2016-01-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12655.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10713
[https://github.com/apache/spark/pull/10713]

> GraphX does not unpersist RDDs
> --
>
> Key: SPARK-12655
> URL: https://issues.apache.org/jira/browse/SPARK-12655
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Alexander Pivovarov
>Priority: Minor
> Fix For: 2.0.0
>
>
> Looks like Graph does not clean all RDDs from the cache on unpersist
> {code}
> // open spark-shell 1.5.2 or 1.6.0
> // run
> import org.apache.spark.graphx._
> val vert = sc.parallelize(List((1L, 1), (2L, 2), (3L, 3)), 1)
> val edges = sc.parallelize(List(Edge[Long](1L, 2L), Edge[Long](1L, 3L)), 1)
> val g0 = Graph(vert, edges)
> val g = g0.partitionBy(PartitionStrategy.EdgePartition2D, 2)
> val cc = g.connectedComponents()
> cc.unpersist()
> g.unpersist()
> g0.unpersist()
> vert.unpersist()
> edges.unpersist()
> {code}
> open http://localhost:4040/storage/
> Spark UI 4040 Storage page still shows 2 items
> {code}
> VertexRDD  Memory Deserialized 1x Replicated   1  100%1688.0 B   0.0 
> B  0.0 B
> EdgeRDDMemory Deserialized 1x Replicated   2  100%  4.7 KB   0.0 
> B  0.0 B
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12655) GraphX does not unpersist RDDs

2016-01-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12655:
--
Assignee: Jason C Lee

> GraphX does not unpersist RDDs
> --
>
> Key: SPARK-12655
> URL: https://issues.apache.org/jira/browse/SPARK-12655
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Alexander Pivovarov
>Assignee: Jason C Lee
>Priority: Minor
> Fix For: 2.0.0
>
>
> Looks like Graph does not clean all RDDs from the cache on unpersist
> {code}
> // open spark-shell 1.5.2 or 1.6.0
> // run
> import org.apache.spark.graphx._
> val vert = sc.parallelize(List((1L, 1), (2L, 2), (3L, 3)), 1)
> val edges = sc.parallelize(List(Edge[Long](1L, 2L), Edge[Long](1L, 3L)), 1)
> val g0 = Graph(vert, edges)
> val g = g0.partitionBy(PartitionStrategy.EdgePartition2D, 2)
> val cc = g.connectedComponents()
> cc.unpersist()
> g.unpersist()
> g0.unpersist()
> vert.unpersist()
> edges.unpersist()
> {code}
> open http://localhost:4040/storage/
> Spark UI 4040 Storage page still shows 2 items
> {code}
> VertexRDD  Memory Deserialized 1x Replicated   1  100%1688.0 B   0.0 
> B  0.0 B
> EdgeRDDMemory Deserialized 1x Replicated   2  100%  4.7 KB   0.0 
> B  0.0 B
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12675) Executor dies because of ClassCastException and causes timeout

2016-01-15 Thread Himanshu Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101567#comment-15101567
 ] 

Himanshu Gupta commented on SPARK-12675:


This issue is arising in Spark 1.5.2 as well. Specifically, we got this error 
when we tried to cache a DataFrame in Spark Version 1.5.2

> Executor dies because of ClassCastException and causes timeout
> --
>
> Key: SPARK-12675
> URL: https://issues.apache.org/jira/browse/SPARK-12675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
> Environment: 64-bit Linux Ubuntu 15.10, 16GB RAM, 8 cores 3ghz
>Reporter: Alexandru Rosianu
>Priority: Minor
>
> I'm trying to fit a Spark ML pipeline but my executor dies. Here's the script 
> which doesn't work (a bit simplified):
> {code:title=Script.scala}
> // Prepare data sets
> logInfo("Getting datasets")
> val emoTrainingData = 
> sqlc.read.parquet("/tw/sentiment/emo/parsed/data.parquet")
> val trainingData = emoTrainingData
> // Configure the pipeline
> val pipeline = new Pipeline().setStages(Array(
>   new 
> FeatureReducer().setInputCol("raw_text").setOutputCol("reduced_text"),
>   new StringSanitizer().setInputCol("reduced_text").setOutputCol("text"),
>   new Tokenizer().setInputCol("text").setOutputCol("raw_words"),
>   new StopWordsRemover().setInputCol("raw_words").setOutputCol("words"),
>   new HashingTF().setInputCol("words").setOutputCol("features"),
>   new NaiveBayes().setSmoothing(0.5).setFeaturesCol("features"),
>   new ColumnDropper().setDropColumns("raw_text", "reduced_text", "text", 
> "raw_words", "words", "features")
> ))
> // Fit the pipeline
> logInfo(s"Training model on ${trainingData.count()} rows")
> val model = pipeline.fit(trainingData)
> {code}
> It executes up to the last line. It prints "Training model on xx rows", then 
> it starts fitting, the executor dies, the drivers doesn't receive heartbeats 
> from the executor and it times out, then the script exits. It doesn't get 
> past that line.
> This is the exception that kills the executor:
> {code}
> java.io.IOException: java.lang.ClassCastException: cannot assign instance 
> of scala.collection.immutable.HashMap$SerializationProxy to field 
> org.apache.spark.executor.TaskMetrics._accumulatorUpdates of type 
> scala.collection.immutable.Map in instance of 
> org.apache.spark.executor.TaskMetrics
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207)
>   at 
> org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
>   at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at org.apache.spark.util.Utils$.deserialize(Utils.scala:92)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:436)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:426)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:426)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:424)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:424)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
>   at 

[jira] [Commented] (SPARK-12739) Details of batch in Streaming tab uses two Duration columns

2016-01-15 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101644#comment-15101644
 ] 

Jacek Laskowski commented on SPARK-12739:
-

Ok, I'll work on it. Thanks.

> Details of batch in Streaming tab uses two Duration columns
> ---
>
> Key: SPARK-12739
> URL: https://issues.apache.org/jira/browse/SPARK-12739
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Minor
> Attachments: SPARK-12739.png
>
>
> "Details of batch" screen in Streaming tab in web UI uses two Duration 
> columns. I think one should be "Processing Time" while the other "Job 
> Duration".
> See the attachment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2930) clarify docs on using webhdfs with spark.yarn.access.namenodes

2016-01-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2930.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10699
[https://github.com/apache/spark/pull/10699]

> clarify docs on using webhdfs with spark.yarn.access.namenodes
> --
>
> Key: SPARK-2930
> URL: https://issues.apache.org/jira/browse/SPARK-2930
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, YARN
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Minor
> Fix For: 2.0.0
>
>
> The documentation of spark.yarn.access.namenodes talks about putting 
> namenodes in it and gives example with hdfs://.  
> I can also be used with webhdfs so we should clarify how to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12836) spark enable both driver run executor & write to HDFS

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12836:


Assignee: Apache Spark

> spark enable both driver run executor & write to HDFS
> -
>
> Key: SPARK-12836
> URL: https://issues.apache.org/jira/browse/SPARK-12836
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Scheduler, Spark Core
>Affects Versions: 1.6.0
> Environment: HADOOP_USER_NAME=qhstats
> SPARK_USER=root
>Reporter: astralidea
>Assignee: Apache Spark
>  Labels: features
>
> when spark set env HADOOP_USER_NAME CoarseMesosSchedulerBackend will set 
> sparkuser from this env, but in my cluster run spark must be root, write HDFS 
> must set HADOOP_USER_NAME, need a configuration set run executor by root & 
> write hdfs by another.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12836) spark enable both driver run executor & write to HDFS

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12836:


Assignee: (was: Apache Spark)

> spark enable both driver run executor & write to HDFS
> -
>
> Key: SPARK-12836
> URL: https://issues.apache.org/jira/browse/SPARK-12836
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Scheduler, Spark Core
>Affects Versions: 1.6.0
> Environment: HADOOP_USER_NAME=qhstats
> SPARK_USER=root
>Reporter: astralidea
>  Labels: features
>
> when spark set env HADOOP_USER_NAME CoarseMesosSchedulerBackend will set 
> sparkuser from this env, but in my cluster run spark must be root, write HDFS 
> must set HADOOP_USER_NAME, need a configuration set run executor by root & 
> write hdfs by another.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12836) spark enable both driver run executor & write to HDFS

2016-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101630#comment-15101630
 ] 

Apache Spark commented on SPARK-12836:
--

User 'Astralidea' has created a pull request for this issue:
https://github.com/apache/spark/pull/10770

> spark enable both driver run executor & write to HDFS
> -
>
> Key: SPARK-12836
> URL: https://issues.apache.org/jira/browse/SPARK-12836
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Scheduler, Spark Core
>Affects Versions: 1.6.0
> Environment: HADOOP_USER_NAME=qhstats
> SPARK_USER=root
>Reporter: astralidea
>  Labels: features
>
> when spark set env HADOOP_USER_NAME CoarseMesosSchedulerBackend will set 
> sparkuser from this env, but in my cluster run spark must be root, write HDFS 
> must set HADOOP_USER_NAME, need a configuration set run executor by root & 
> write hdfs by another.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2930) clarify docs on using webhdfs with spark.yarn.access.namenodes

2016-01-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2930:
-
Priority: Trivial  (was: Minor)

> clarify docs on using webhdfs with spark.yarn.access.namenodes
> --
>
> Key: SPARK-2930
> URL: https://issues.apache.org/jira/browse/SPARK-2930
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, YARN
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Trivial
> Fix For: 2.0.0
>
>
> The documentation of spark.yarn.access.namenodes talks about putting 
> namenodes in it and gives example with hdfs://.  
> I can also be used with webhdfs so we should clarify how to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12836) spark enable both driver run executor & write to HDFS

2016-01-15 Thread astralidea (JIRA)
astralidea created SPARK-12836:
--

 Summary: spark enable both driver run executor & write to HDFS
 Key: SPARK-12836
 URL: https://issues.apache.org/jira/browse/SPARK-12836
 Project: Spark
  Issue Type: Bug
  Components: Mesos, Scheduler, Spark Core
Affects Versions: 1.6.0
 Environment: HADOOP_USER_NAME=qhstats
SPARK_USER=root
Reporter: astralidea


when spark set env HADOOP_USER_NAME CoarseMesosSchedulerBackend will set 
sparkuser from this env, but in my cluster run spark must be root, write HDFS 
must set HADOOP_USER_NAME, need a configuration set run executor by root & 
write hdfs by another.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7683) Confusing behavior of fold function of RDD in pyspark

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7683:
---

Assignee: (was: Apache Spark)

> Confusing behavior of fold function of RDD in pyspark
> -
>
> Key: SPARK-7683
> URL: https://issues.apache.org/jira/browse/SPARK-7683
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 1.3.1
>Reporter: Ai He
>Priority: Minor
>  Labels: releasenotes
>
> This will make the “fold” function consistent with the "fold" in rdd.scala 
> and other "aggregate" functions where “acc” goes first. Otherwise, users have 
> to write a lambda function like “lambda x, y: op(y, x)” if they want to use 
> “zeroValue” to change the result type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7683) Confusing behavior of fold function of RDD in pyspark

2016-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101728#comment-15101728
 ] 

Apache Spark commented on SPARK-7683:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10771

> Confusing behavior of fold function of RDD in pyspark
> -
>
> Key: SPARK-7683
> URL: https://issues.apache.org/jira/browse/SPARK-7683
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 1.3.1
>Reporter: Ai He
>Priority: Minor
>  Labels: releasenotes
>
> This will make the “fold” function consistent with the "fold" in rdd.scala 
> and other "aggregate" functions where “acc” goes first. Otherwise, users have 
> to write a lambda function like “lambda x, y: op(y, x)” if they want to use 
> “zeroValue” to change the result type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7683) Confusing behavior of fold function of RDD in pyspark

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7683:
---

Assignee: Apache Spark

> Confusing behavior of fold function of RDD in pyspark
> -
>
> Key: SPARK-7683
> URL: https://issues.apache.org/jira/browse/SPARK-7683
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 1.3.1
>Reporter: Ai He
>Assignee: Apache Spark
>Priority: Minor
>  Labels: releasenotes
>
> This will make the “fold” function consistent with the "fold" in rdd.scala 
> and other "aggregate" functions where “acc” goes first. Otherwise, users have 
> to write a lambda function like “lambda x, y: op(y, x)” if they want to use 
> “zeroValue” to change the result type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-01-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101742#comment-15101742
 ] 

Krzysztof Gawryś commented on SPARK-10528:
--

I have the same problem using spark 1.5.2 and windows 7 x64 and none of above 
fixes helped. The permissions to folder /tmp/hive are ok but they are changed 
during code execution.

In my case the problem is in hadoop-common:2.6.0:jar in 
org.apache.hadoop.fs.RawLocalFileSystem class in loadPermissionInfo() method.
It is trying to execute command F:\spark\bin\winutils.exe ls -F D:\tmp\hive  in 
shell and this command returns "Incorrect command line arguments." this result 
in exception which is catched in loadPermissionInfo and the permissions are 
changed to default because of that:
line 609 RawLocalFileSystem.java
{code}
if (ioe.getExitCode() != 1) {
  e = ioe;
} else {
  setPermission(null);
  setOwner(null);
  setGroup(null);
}
{code}

> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2016-01-15 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-12837:
-
Description: 
Executing a sql statement with a large number of partitions requires a high 
memory space for the driver even there are no requests to collect data back to 
the driver.

Here are steps to re-produce the issue.
1. Start spark shell with a spark.driver.maxResultSize setting
{code:java}
bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
{code}
2. Execute the code 
{code:java}
case class Toto( a: Int, b: Int)
val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF

sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK

sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
 "toto2" ) // ERROR
{code}

The error message is 
{code:java}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
spark.driver.maxResultSize (1024.0 KB)
{code}


  was:
Executing a sql statement with a large number of partitions requires a high 
memory space for the driver even there are no requests to collect data back to 
the driver.

Here are steps to re-produce the issue.
1. Start spark shell with a spark.driver.maxResultSize setting
{code:shell}
bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
{code}
2. Execute the code 
{code:scala}
case class Toto( a: Int, b: Int)
val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF

sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK

sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
 "toto2" ) // ERROR
{code}

The error message is 
{code:scala}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
spark.driver.maxResultSize (1024.0 KB)
{code}



> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2016-01-15 Thread Tien-Dung LE (JIRA)
Tien-Dung LE created SPARK-12837:


 Summary: Spark driver requires large memory space for serialized 
results even there are no data collected to the driver
 Key: SPARK-12837
 URL: https://issues.apache.org/jira/browse/SPARK-12837
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 1.6.0, 1.5.2
Reporter: Tien-Dung LE


Executing a sql statement with a large number of partitions requires a high 
memory space for the driver even there are no requests to collect data back to 
the driver.

Here are steps to re-produce the issue.
1. Start spark shell with a spark.driver.maxResultSize setting
{code:shell}
bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
{code}
2. Execute the code 
{code:scala}
case class Toto( a: Int, b: Int)
val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF

sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK

sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
 "toto2" ) // ERROR
{code}

The error message is 
{code:scala}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
spark.driver.maxResultSize (1024.0 KB)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-15 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101805#comment-15101805
 ] 

Sun Rui commented on SPARK-6817:


Spark is now supporting vectorized execution via Columnar batch. See 
SPARK-12785 and SPARK-12635. I hope this could benefit SparkR UDF.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12786) Actor demo does not demonstrate usable code

2016-01-15 Thread Brian London (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101918#comment-15101918
 ] 

Brian London edited comment on SPARK-12786 at 1/15/16 3:29 PM:
---

Yeah, exactly.  Because of the use of {{AkkaUtil}} the settings needed to 
create a minimal actor system that can communicate with the actor stream is 
buried in the Spark code.

I believe the change at 
https://github.com/apache/spark/pull/10744/files#diff-690ab3eacd0a42fe7bee1d29c5910ffdR111
 will resolve this issue.

On a side note, are there plans to include classes to use actors as a DStream 
output as well?


was (Author: brianlondon):
Yeah, exactly.  Because of the use of `AkkaUtil` the settings needed to create 
a minimal actor system that can communicate with the actor stream is buried in 
the Spark code.

I believe the change at 
https://github.com/apache/spark/pull/10744/files#diff-690ab3eacd0a42fe7bee1d29c5910ffdR111
 will resolve this issue.

On a side note, are there plans to include classes to use actors as a DStream 
output as well?

> Actor demo does not demonstrate usable code
> ---
>
> Key: SPARK-12786
> URL: https://issues.apache.org/jira/browse/SPARK-12786
> Project: Spark
>  Issue Type: Documentation
>  Components: Streaming
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Brian London
>Priority: Minor
>
> The ActorWordCount demo doesn't show how to set up an actor based dstream in 
> a way that can be used.  
> The demo relies on the {{AkkaUtils}} object, which is marked private[spark].  
> Thus the code presented will not compile unless users declare their code to 
> be in the org.apache.spark package. 
> Demo is located at 
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/ActorWordCount.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12786) Actor demo does not demonstrate usable code

2016-01-15 Thread Brian London (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101918#comment-15101918
 ] 

Brian London commented on SPARK-12786:
--

Yeah, exactly.  Because of the use of `AkkaUtil` the settings needed to create 
a minimal actor system that can communicate with the actor stream is buried in 
the Spark code.

I believe the change at 
https://github.com/apache/spark/pull/10744/files#diff-690ab3eacd0a42fe7bee1d29c5910ffdR111
 will resolve this issue.

On a side note, are there plans to include classes to use actors as a DStream 
output as well?

> Actor demo does not demonstrate usable code
> ---
>
> Key: SPARK-12786
> URL: https://issues.apache.org/jira/browse/SPARK-12786
> Project: Spark
>  Issue Type: Documentation
>  Components: Streaming
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Brian London
>Priority: Minor
>
> The ActorWordCount demo doesn't show how to set up an actor based dstream in 
> a way that can be used.  
> The demo relies on the {{AkkaUtils}} object, which is marked private[spark].  
> Thus the code presented will not compile unless users declare their code to 
> be in the org.apache.spark package. 
> Demo is located at 
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/ActorWordCount.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12834) Use type conversion instead of Ser/De of Pickle to transform JavaArray and JavaList

2016-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101809#comment-15101809
 ] 

Apache Spark commented on SPARK-12834:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10772

> Use type conversion instead of Ser/De of Pickle to transform JavaArray and 
> JavaList
> ---
>
> Key: SPARK-12834
> URL: https://issues.apache.org/jira/browse/SPARK-12834
> Project: Spark
>  Issue Type: Improvement
>Reporter: Xusen Yin
>
> According to the Ser/De code in Python side:
> {code:title=StringIndexerModel|theme=FadeToGrey|linenumbers=true|language=python|firstline=0001|collapse=false}
>   def _java2py(sc, r, encoding="bytes"):
> if isinstance(r, JavaObject):
> clsName = r.getClass().getSimpleName()
> # convert RDD into JavaRDD
> if clsName != 'JavaRDD' and clsName.endswith("RDD"):
> r = r.toJavaRDD()
> clsName = 'JavaRDD'
> if clsName == 'JavaRDD':
> jrdd = sc._jvm.SerDe.javaToPython(r)
> return RDD(jrdd, sc)
> if clsName == 'DataFrame':
> return DataFrame(r, SQLContext.getOrCreate(sc))
> if clsName in _picklable_classes:
> r = sc._jvm.SerDe.dumps(r)
> elif isinstance(r, (JavaArray, JavaList)):
> try:
> r = sc._jvm.SerDe.dumps(r)
> except Py4JJavaError:
> pass  # not pickable
> if isinstance(r, (bytearray, bytes)):
> r = PickleSerializer().loads(bytes(r), encoding=encoding)
> return r
> {code}
> We use SerDe.dumps to serialize JavaArray and JavaList in PythonMLLibAPI, 
> then deserialize them with PickleSerializer in Python side. However, there is 
> no need to transform them in such an inefficient way. Instead of it, we can 
> use type conversion to convert them, e.g. list(JavaArray) or list(JavaList). 
> What's more, there is an issue to Ser/De Scala Array as I said in 
> https://issues.apache.org/jira/browse/SPARK-12780



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12834) Use type conversion instead of Ser/De of Pickle to transform JavaArray and JavaList

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12834:


Assignee: (was: Apache Spark)

> Use type conversion instead of Ser/De of Pickle to transform JavaArray and 
> JavaList
> ---
>
> Key: SPARK-12834
> URL: https://issues.apache.org/jira/browse/SPARK-12834
> Project: Spark
>  Issue Type: Improvement
>Reporter: Xusen Yin
>
> According to the Ser/De code in Python side:
> {code:title=StringIndexerModel|theme=FadeToGrey|linenumbers=true|language=python|firstline=0001|collapse=false}
>   def _java2py(sc, r, encoding="bytes"):
> if isinstance(r, JavaObject):
> clsName = r.getClass().getSimpleName()
> # convert RDD into JavaRDD
> if clsName != 'JavaRDD' and clsName.endswith("RDD"):
> r = r.toJavaRDD()
> clsName = 'JavaRDD'
> if clsName == 'JavaRDD':
> jrdd = sc._jvm.SerDe.javaToPython(r)
> return RDD(jrdd, sc)
> if clsName == 'DataFrame':
> return DataFrame(r, SQLContext.getOrCreate(sc))
> if clsName in _picklable_classes:
> r = sc._jvm.SerDe.dumps(r)
> elif isinstance(r, (JavaArray, JavaList)):
> try:
> r = sc._jvm.SerDe.dumps(r)
> except Py4JJavaError:
> pass  # not pickable
> if isinstance(r, (bytearray, bytes)):
> r = PickleSerializer().loads(bytes(r), encoding=encoding)
> return r
> {code}
> We use SerDe.dumps to serialize JavaArray and JavaList in PythonMLLibAPI, 
> then deserialize them with PickleSerializer in Python side. However, there is 
> no need to transform them in such an inefficient way. Instead of it, we can 
> use type conversion to convert them, e.g. list(JavaArray) or list(JavaList). 
> What's more, there is an issue to Ser/De Scala Array as I said in 
> https://issues.apache.org/jira/browse/SPARK-12780



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12834) Use type conversion instead of Ser/De of Pickle to transform JavaArray and JavaList

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12834:


Assignee: Apache Spark

> Use type conversion instead of Ser/De of Pickle to transform JavaArray and 
> JavaList
> ---
>
> Key: SPARK-12834
> URL: https://issues.apache.org/jira/browse/SPARK-12834
> Project: Spark
>  Issue Type: Improvement
>Reporter: Xusen Yin
>Assignee: Apache Spark
>
> According to the Ser/De code in Python side:
> {code:title=StringIndexerModel|theme=FadeToGrey|linenumbers=true|language=python|firstline=0001|collapse=false}
>   def _java2py(sc, r, encoding="bytes"):
> if isinstance(r, JavaObject):
> clsName = r.getClass().getSimpleName()
> # convert RDD into JavaRDD
> if clsName != 'JavaRDD' and clsName.endswith("RDD"):
> r = r.toJavaRDD()
> clsName = 'JavaRDD'
> if clsName == 'JavaRDD':
> jrdd = sc._jvm.SerDe.javaToPython(r)
> return RDD(jrdd, sc)
> if clsName == 'DataFrame':
> return DataFrame(r, SQLContext.getOrCreate(sc))
> if clsName in _picklable_classes:
> r = sc._jvm.SerDe.dumps(r)
> elif isinstance(r, (JavaArray, JavaList)):
> try:
> r = sc._jvm.SerDe.dumps(r)
> except Py4JJavaError:
> pass  # not pickable
> if isinstance(r, (bytearray, bytes)):
> r = PickleSerializer().loads(bytes(r), encoding=encoding)
> return r
> {code}
> We use SerDe.dumps to serialize JavaArray and JavaList in PythonMLLibAPI, 
> then deserialize them with PickleSerializer in Python side. However, there is 
> no need to transform them in such an inefficient way. Instead of it, we can 
> use type conversion to convert them, e.g. list(JavaArray) or list(JavaList). 
> What's more, there is an issue to Ser/De Scala Array as I said in 
> https://issues.apache.org/jira/browse/SPARK-12780



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12838) fix a problem in PythonRDD.scala

2016-01-15 Thread zhanglu (JIRA)
zhanglu created SPARK-12838:
---

 Summary: fix a problem in PythonRDD.scala 
 Key: SPARK-12838
 URL: https://issues.apache.org/jira/browse/SPARK-12838
 Project: Spark
  Issue Type: Bug
Reporter: zhanglu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12838) fix a problem in PythonRDD.scala

2016-01-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12838.
---
Resolution: Invalid

> fix a problem in PythonRDD.scala 
> -
>
> Key: SPARK-12838
> URL: https://issues.apache.org/jira/browse/SPARK-12838
> Project: Spark
>  Issue Type: Bug
>Reporter: zhanglu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12834) Use type conversion instead of Ser/De of Pickle to transform JavaArray and JavaList

2016-01-15 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-12834:
--
Description: 
According to the Ser/De code in Python side:

{code:title=StringIndexerModel|theme=FadeToGrey|linenumbers=true|language=python|firstline=0001|collapse=false}
  def _java2py(sc, r, encoding="bytes"):
if isinstance(r, JavaObject):
clsName = r.getClass().getSimpleName()
# convert RDD into JavaRDD
if clsName != 'JavaRDD' and clsName.endswith("RDD"):
r = r.toJavaRDD()
clsName = 'JavaRDD'

if clsName == 'JavaRDD':
jrdd = sc._jvm.SerDe.javaToPython(r)
return RDD(jrdd, sc)

if clsName == 'DataFrame':
return DataFrame(r, SQLContext.getOrCreate(sc))

if clsName in _picklable_classes:
r = sc._jvm.SerDe.dumps(r)
elif isinstance(r, (JavaArray, JavaList)):
try:
r = sc._jvm.SerDe.dumps(r)
except Py4JJavaError:
pass  # not pickable

if isinstance(r, (bytearray, bytes)):
r = PickleSerializer().loads(bytes(r), encoding=encoding)
return r
{code}

We use SerDe.dumps to serialize JavaArray and JavaList in PythonMLLibAPI, then 
deserialize them with PickleSerializer in Python side. However, there is no 
need to transform them in such an inefficient way. Instead of it, we can use 
type conversion to convert them, e.g. list(JavaArray) or list(JavaList). What's 
more, there is an issue to Ser/De Scala Array as I said in 
https://issues.apache.org/jira/browse/SPARK-12780

  was:
According to the Ser/De code in Python side:

{code:title=StringIndexerModel|theme=FadeToGrey|linenumbers=true|language=python|firstline=0001|collapse=false}
  def _java2py(sc, r, encoding="bytes"):
if isinstance(r, JavaObject):
clsName = r.getClass().getSimpleName()
# convert RDD into JavaRDD
if clsName != 'JavaRDD' and clsName.endswith("RDD"):
r = r.toJavaRDD()
clsName = 'JavaRDD'

if clsName == 'JavaRDD':
jrdd = sc._jvm.SerDe.javaToPython(r)
return RDD(jrdd, sc)

if clsName == 'DataFrame':
return DataFrame(r, SQLContext.getOrCreate(sc))

if clsName in _picklable_classes:
r = sc._jvm.SerDe.dumps(r)
elif isinstance(r, (JavaArray, JavaList)):
try:
r = sc._jvm.SerDe.dumps(r)
except Py4JJavaError:
pass  # not pickable

if isinstance(r, (bytearray, bytes)):
r = PickleSerializer().loads(bytes(r), encoding=encoding)
return r
{code}

We use SerDe.sumps to serialize JavaArray and JavaList in PythonMLLibAPI, then 
deserialize them with PickleSerializer in Python side. However, there is no 
need to transform them in such an inefficient way. Instead of it, we can use 
type conversion to convert them, e.g. list(JavaArray) or list(JavaList). What's 
more, there is an issue to Ser/De Scala Array as I said in 
https://issues.apache.org/jira/browse/SPARK-12780


> Use type conversion instead of Ser/De of Pickle to transform JavaArray and 
> JavaList
> ---
>
> Key: SPARK-12834
> URL: https://issues.apache.org/jira/browse/SPARK-12834
> Project: Spark
>  Issue Type: Improvement
>Reporter: Xusen Yin
>
> According to the Ser/De code in Python side:
> {code:title=StringIndexerModel|theme=FadeToGrey|linenumbers=true|language=python|firstline=0001|collapse=false}
>   def _java2py(sc, r, encoding="bytes"):
> if isinstance(r, JavaObject):
> clsName = r.getClass().getSimpleName()
> # convert RDD into JavaRDD
> if clsName != 'JavaRDD' and clsName.endswith("RDD"):
> r = r.toJavaRDD()
> clsName = 'JavaRDD'
> if clsName == 'JavaRDD':
> jrdd = sc._jvm.SerDe.javaToPython(r)
> return RDD(jrdd, sc)
> if clsName == 'DataFrame':
> return DataFrame(r, SQLContext.getOrCreate(sc))
> if clsName in _picklable_classes:
> r = sc._jvm.SerDe.dumps(r)
> elif isinstance(r, (JavaArray, JavaList)):
> try:
> r = sc._jvm.SerDe.dumps(r)
> except Py4JJavaError:
> pass  # not pickable
> if isinstance(r, (bytearray, bytes)):
> r = PickleSerializer().loads(bytes(r), encoding=encoding)
> return r
> {code}
> We use SerDe.dumps to serialize JavaArray and JavaList in PythonMLLibAPI, 
> then deserialize them with PickleSerializer in Python side. However, there is 
> no need to transform them in such an inefficient way. Instead of it, we can 
> use type conversion to convert them, e.g. list(JavaArray) or list(JavaList). 
> What's more, there is an issue to Ser/De Scala 

[jira] [Commented] (SPARK-11031) SparkR str() method on DataFrame objects

2016-01-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101936#comment-15101936
 ] 

Shivaram Venkataraman commented on SPARK-11031:
---

Resolved by https://github.com/apache/spark/pull/9613

> SparkR str() method on DataFrame objects
> 
>
> Key: SPARK-11031
> URL: https://issues.apache.org/jira/browse/SPARK-11031
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>Assignee: Oscar D. Lara Yejas
> Fix For: 1.6.1, 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11031) SparkR str() method on DataFrame objects

2016-01-15 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-11031.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.6.1

> SparkR str() method on DataFrame objects
> 
>
> Key: SPARK-11031
> URL: https://issues.apache.org/jira/browse/SPARK-11031
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>Assignee: Oscar D. Lara Yejas
> Fix For: 1.6.1, 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-12807:
---
Priority: Critical  (was: Major)

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-01-15 Thread Amir Gur (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093151#comment-15093151
 ] 

Amir Gur edited comment on SPARK-10528 at 1/15/16 5:41 PM:
---

Should this not be reopened given is still happens to many folks as the last 
recent comments suggest?

[~srowen] said it is env issue (at 
https://issues.apache.org/jira/browse/SPARK-10528?focusedCommentId=14958759=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14958759)
 and closed it.  Some posted workarounds which solved for them.  For me non of 
those worked.

To reproduce, getting the same with either spark-Shell --master local[2] or 
using a plain maven project to which I added the HiveFromSpark example from the 
spark codebase, running on win8x64, just picking spark 1.5.2 (or 1.6.0), 
spark-hive_2.11 (or 2.10), and running with sparkConf.setMaster("local").

I got a debug session opened and showing it is coming from the {{throw new 
RuntimeException}} at 
org.apache.hadoop.hive.ql.session.SessionState#createRootHDFSDir at line 612 of 
org/spark-project/hive/hive-exec/1.2.1.spark/hive-exec-1.2.1.spark-sources.jar, 
which is:

{code}
// If the root HDFS scratch dir already exists, make sure it is writeable.
if (!((currentHDFSDirPermission.toShort() & writableHDFSDirPermission
.toShort()) == writableHDFSDirPermission.toShort())) {
  throw new RuntimeException("The root scratch dir: " + rootHDFSDirPath
  + " on HDFS should be writable. Current permissions are: " + 
currentHDFSDirPermission);
}
{code}


was (Author: agur):
Should this not be reopened given is still happens to many folks as the last 
recent comments suggest?

[~srowen] said it is env issue (at 
https://issues.apache.org/jira/browse/SPARK-10528?focusedCommentId=14958759=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14958759)
 and closed it.  Some posted workarounds which solved for them.  For me non of 
those worked.

To reproduce, getting the same with either spark-Shell --master local[2] or 
using a plain maven project to which I added the HiveFromSpark example from the 
spark codebase, running on win8x64, just picking spark 1.5.2 (or 1.6.0), 
spark-hive_2.11 (or 2.10), and running with sparkConf.setMaster("local").

> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102158#comment-15102158
 ] 

Steve Loughran commented on SPARK-12807:


We can replicate this intermittently. It all depends on classpath ordering in 
the NM. If either versions complete set of JARs are loaded first: all is well. 
If there's a mix: stack trace.

The ordering can not only break the shuffle and so DRA, it can stop the NM 
coming up. This is generally considered a serious issue by ops teams.

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-01-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102162#comment-15102162
 ] 

Sean Owen commented on SPARK-10528:
---

I'm not suggesting it's not a problem; I'm left wondering what Spark can do 
about it though? it seems specific to Windows and Hive.

> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12825) Spark-submit Jar URL loading fails on redirect

2016-01-15 Thread Alex Nederlof (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102163#comment-15102163
 ] 

Alex Nederlof commented on SPARK-12825:
---

It's a `307 temporary redirect` response code, returned by Nexus. 

The full command was 

{code}
spark-submit --master spark://spark.ourserver.com:6066 --deploy-mode cluster 
--class me.our.SparkJob --executor-memory 4g --driver-memory 4g 
"http://our.nexus.repo/service/local/artifact/maven/redirect?r=snapshots=me.magnet=sparkjob=LATEST;
{code}

In curl you'd see

{code}
[alex in ~]$ curl -v 
"http://our.nexus.repo/service/local/artifact/maven/redirect?r=snapshots=me.magnet=sparkjob=LATEST;
*   Trying 10.8.0.2...
* Connected to our.nexus.repo (10.8.0.2) port 80 (#0)
> GET redirect?r=snapshots=me.magnet=sparkjob=LATEST HTTP/1.1
> Host: our.nexus.repo
> User-Agent: curl/7.43.0
> Accept: */*
>
< HTTP/1.1 307 Temporary Redirect
< Server: nginx/1.8.0
< Date: Fri, 15 Jan 2016 17:36:58 GMT
< Content-Type: application/xml; charset=ISO-8859-1
< Content-Length: 198
< Connection: keep-alive
< X-Frame-Options: SAMEORIGIN
< X-Content-Type-Options: nosniff
< Location: 
http://our.nexus.repo/service/local/repositories/snapshots/content/me/magnet/spark-job/1.1-SNAPSHOT/spark-job-1.1-20151231.112230-92.jar
< Vary: Accept-Charset, Accept-Encoding, Accept-Language, Accept
{code}

> Spark-submit Jar URL loading fails on redirect
> --
>
> Key: SPARK-12825
> URL: https://issues.apache.org/jira/browse/SPARK-12825
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Alex Nederlof
>Priority: Minor
>
> When you use spark-submit, and pass the jar as a URL, it fails when the URL 
> redirects. 
> The log prints: 
> {code}
> 16/01/14 14:26:43 INFO Utils: Fetching http://myUrl/my.jar to 
> /opt/spark/spark-1.6.0-bin-hadoop2.6/work/driver-20160114142642-0010/fetchFileTemp8495494631100918254.tmp
> {code}
> However, that file doesn't exist, but a file called "redirect" is created, 
> with the appropriate content. 
> After that, the driver fails with
> {code}
> 16/01/14 14:26:43 WARN Worker: Driver driver-20160114142642-0010 failed with 
> unrecoverable exception: java.lang.Exception: Did not see expected jar my.jar 
> in /opt/spark/spark-1.6.0-bin-hadoop2.6/work/driver-20160114142642-0010
> {code}
> Here's the related code:
> https://github.com/apache/spark/blob/56cdbd654d54bf07a063a03a5c34c4165818eeb2/core/src/main/scala/org/apache/spark/util/Utils.scala#L583-L603
> My Scala chops aren't up to this challenge, otherwise I would have made a 
> patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-01-15 Thread Amir Gur (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093151#comment-15093151
 ] 

Amir Gur edited comment on SPARK-10528 at 1/15/16 5:45 PM:
---

Should this not be reopened given is still happens to many folks as the last 
recent comments suggest?

[~srowen] said it is env issue at [#comment-14958759] and then closed it.  Some 
posted workarounds which solved for them.  For me non of those worked.

To reproduce, getting the same with either spark-Shell --master local[2] or 
using a plain maven project to which I added the HiveFromSpark example from the 
spark codebase, running on win8x64, just picking spark 1.5.2 (or 1.6.0), 
spark-hive_2.11 (or 2.10), and running with sparkConf.setMaster("local").

I got a debug session opened and showing it is coming from the {{throw new 
RuntimeException}} at 
org.apache.hadoop.hive.ql.session.SessionState#createRootHDFSDir at line 612 of 
org/spark-project/hive/hive-exec/1.2.1.spark/hive-exec-1.2.1.spark-sources.jar, 
which is:

{code}
// If the root HDFS scratch dir already exists, make sure it is writeable.
if (!((currentHDFSDirPermission.toShort() & writableHDFSDirPermission
.toShort()) == writableHDFSDirPermission.toShort())) {
  throw new RuntimeException("The root scratch dir: " + rootHDFSDirPath
  + " on HDFS should be writable. Current permissions are: " + 
currentHDFSDirPermission);
}
{code}


was (Author: agur):
Should this not be reopened given is still happens to many folks as the last 
recent comments suggest?

[~srowen] said it is env issue at: [#comment-14958759] and then closed it.  
Some posted workarounds which solved for them.  For me non of those worked.

To reproduce, getting the same with either spark-Shell --master local[2] or 
using a plain maven project to which I added the HiveFromSpark example from the 
spark codebase, running on win8x64, just picking spark 1.5.2 (or 1.6.0), 
spark-hive_2.11 (or 2.10), and running with sparkConf.setMaster("local").

I got a debug session opened and showing it is coming from the {{throw new 
RuntimeException}} at 
org.apache.hadoop.hive.ql.session.SessionState#createRootHDFSDir at line 612 of 
org/spark-project/hive/hive-exec/1.2.1.spark/hive-exec-1.2.1.spark-sources.jar, 
which is:

{code}
// If the root HDFS scratch dir already exists, make sure it is writeable.
if (!((currentHDFSDirPermission.toShort() & writableHDFSDirPermission
.toShort()) == writableHDFSDirPermission.toShort())) {
  throw new RuntimeException("The root scratch dir: " + rootHDFSDirPath
  + " on HDFS should be writable. Current permissions are: " + 
currentHDFSDirPermission);
}
{code}

> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-01-15 Thread Amir Gur (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093151#comment-15093151
 ] 

Amir Gur edited comment on SPARK-10528 at 1/15/16 5:45 PM:
---

Should this not be reopened given is still happens to many folks as the last 
recent comments suggest?

[~srowen] said it is env issue at: [#comment-14958759] and then closed it.  
Some posted workarounds which solved for them.  For me non of those worked.

To reproduce, getting the same with either spark-Shell --master local[2] or 
using a plain maven project to which I added the HiveFromSpark example from the 
spark codebase, running on win8x64, just picking spark 1.5.2 (or 1.6.0), 
spark-hive_2.11 (or 2.10), and running with sparkConf.setMaster("local").

I got a debug session opened and showing it is coming from the {{throw new 
RuntimeException}} at 
org.apache.hadoop.hive.ql.session.SessionState#createRootHDFSDir at line 612 of 
org/spark-project/hive/hive-exec/1.2.1.spark/hive-exec-1.2.1.spark-sources.jar, 
which is:

{code}
// If the root HDFS scratch dir already exists, make sure it is writeable.
if (!((currentHDFSDirPermission.toShort() & writableHDFSDirPermission
.toShort()) == writableHDFSDirPermission.toShort())) {
  throw new RuntimeException("The root scratch dir: " + rootHDFSDirPath
  + " on HDFS should be writable. Current permissions are: " + 
currentHDFSDirPermission);
}
{code}


was (Author: agur):
Should this not be reopened given is still happens to many folks as the last 
recent comments suggest?

[~srowen] said it is env issue (at 
https://issues.apache.org/jira/browse/SPARK-10528?focusedCommentId=14958759=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14958759)
 and closed it.  Some posted workarounds which solved for them.  For me non of 
those worked.

To reproduce, getting the same with either spark-Shell --master local[2] or 
using a plain maven project to which I added the HiveFromSpark example from the 
spark codebase, running on win8x64, just picking spark 1.5.2 (or 1.6.0), 
spark-hive_2.11 (or 2.10), and running with sparkConf.setMaster("local").

I got a debug session opened and showing it is coming from the {{throw new 
RuntimeException}} at 
org.apache.hadoop.hive.ql.session.SessionState#createRootHDFSDir at line 612 of 
org/spark-project/hive/hive-exec/1.2.1.spark/hive-exec-1.2.1.spark-sources.jar, 
which is:

{code}
// If the root HDFS scratch dir already exists, make sure it is writeable.
if (!((currentHDFSDirPermission.toShort() & writableHDFSDirPermission
.toShort()) == writableHDFSDirPermission.toShort())) {
  throw new RuntimeException("The root scratch dir: " + rootHDFSDirPath
  + " on HDFS should be writable. Current permissions are: " + 
currentHDFSDirPermission);
}
{code}

> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102169#comment-15102169
 ] 

Sean Owen commented on SPARK-12807:
---

Yes in general I'd assume Spark's classes/dependencies are supposed to come 
first in order to work. This certainly doesn't resolve all possible problems 
but yes I would expect more problems if other older versions of libs are given 
precdence.

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-01-15 Thread Amir Gur (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102178#comment-15102178
 ] 

Amir Gur commented on SPARK-10528:
--

Thanks [~kgawrys] for the confirmation.  

[~srowen], thanks for the note, yes agree it is specific for windows and hive.

Since this seems to be a bug in Hive per [#comment-14739302] and 
[#comment-15093151] - should we re-open this one and file a hive bug blocking 
it to track resolution?




> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2016-01-15 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102183#comment-15102183
 ] 

Zhan Zhang commented on SPARK-5159:
---

What happen if an user have a valid visit to a table, which will be saved in 
catalog. Another user then also can visit the table as it is cached in local 
hivecatalog, even if the latter does not have the access to the table, right? 
To make the impersonate to really work, all the information has to be tagged by 
user, right?

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
> Attachments: spark_thrift_server_log.txt
>
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2016-01-15 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102183#comment-15102183
 ] 

Zhan Zhang edited comment on SPARK-5159 at 1/15/16 5:50 PM:


What happen if an user have a valid visit to a table, which will be saved in 
catalog. Another user then also can visit the table as it is cached in local 
hivecatalog, even if the latter does not have the access to the table meta 
data, right? To make the impersonate to work, all the information has to be 
tagged by user, right?


was (Author: zzhan):
What happen if an user have a valid visit to a table, which will be saved in 
catalog. Another user then also can visit the table as it is cached in local 
hivecatalog, even if the latter does not have the access to the table, right? 
To make the impersonate to really work, all the information has to be tagged by 
user, right?

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
> Attachments: spark_thrift_server_log.txt
>
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-01-15 Thread Amir Gur (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102178#comment-15102178
 ] 

Amir Gur edited comment on SPARK-10528 at 1/15/16 5:54 PM:
---

Thanks [~kgawrys] for the confirmation.  

[~srowen], thanks for the note, yes agree it is specific for windows and hive.

Since this seems to be a bug in Hive per [#comment-14739302] and 
[#comment-15093151] - should we re-open this one and file a hive bug blocking 
it to track resolution?

And can also keep looking for a spark level workaround or fix as [~srowen] 
suggested.

And also recalling that in the spark 1.5 presentations of Spark SQL and 
DataFrames, there were mentions of getting rid of the significant dependency on 
hive codebase and a re-write of the stack which was aparently a great success 
overall on functionality and performance.
This dependency however is still around, I am not familiar enough with the 
codebase to tell, and am just estimating this one is unrelated to successful 
spark sql re-design/re-write.



was (Author: agur):
Thanks [~kgawrys] for the confirmation.  

[~srowen], thanks for the note, yes agree it is specific for windows and hive.

Since this seems to be a bug in Hive per [#comment-14739302] and 
[#comment-15093151] - should we re-open this one and file a hive bug blocking 
it to track resolution?




> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-01-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102190#comment-15102190
 ] 

Sean Owen commented on SPARK-10528:
---

I don't see a value in opening this, as there is no action in Spark to track. 
However, it can certainly be linked.

> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-01-15 Thread Amir Gur (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093151#comment-15093151
 ] 

Amir Gur edited comment on SPARK-10528 at 1/15/16 5:28 PM:
---

Should this not be reopened given is still happens to many folks as the last 
recent comments suggest?

[~srowen] said it is env issue (at 
https://issues.apache.org/jira/browse/SPARK-10528?focusedCommentId=14958759=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14958759)
 and closed it.  Some posted workarounds which solved for them.  For me non of 
those worked.

To reproduce, getting the same with either spark-Shell --master local[2] or 
using a plain maven project to which I added the HiveFromSpark example from the 
spark codebase, running on win8x64, just picking spark 1.5.2 (or 1.6.0), 
spark-hive_2.11 (or 2.10), and running with sparkConf.setMaster("local").


was (Author: agur):
Should this not be reopened given is still happens to many folks as the last 
recent comments suggest?

[~srowen] said it is env issue (at 
https://issues.apache.org/jira/browse/SPARK-10528?focusedCommentId=14958759=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14958759)
 and closed it.  Some posted workarounds which solved for them.  For me non of 
those worked.

To reproduce I am using a plain maven project to which I added the 
HiveFromSpark example from the spark codebase, running on win8x64, just picking 
spark 1.5.2, spark-hive_2.11, and running with sparkConf.setMaster("local").

> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9625) SparkILoop creates sql context continuously, thousands of times

2016-01-15 Thread Alex Spencer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101992#comment-15101992
 ] 

Alex Spencer commented on SPARK-9625:
-

I'm getting this same problem today - spark 1.3.0. I can't see this being 
closed with a fix, but rather that Simeon the OP closed it himself?

It's reproducible for me, I can post code if needed.

> SparkILoop creates sql context continuously, thousands of times
> ---
>
> Key: SPARK-9625
> URL: https://issues.apache.org/jira/browse/SPARK-9625
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>  Labels: sql
>
> Occasionally but repeatably, based on the Spark SQL operations being run, 
> {{spark-shell}} gets into a funk where it attempts to create a sql context 
> over and over again as it is doing its work. Example output below:
> {code}
> 15/08/05 03:04:12 INFO DAGScheduler: looking for newly runnable stages
> 15/08/05 03:04:12 INFO DAGScheduler: running: Set()
> 15/08/05 03:04:12 INFO DAGScheduler: waiting: Set(ShuffleMapStage 7, 
> ResultStage 8)
> 15/08/05 03:04:12 INFO DAGScheduler: failed: Set()
> 15/08/05 03:04:12 INFO DAGScheduler: Missing parents for ShuffleMapStage 7: 
> List()
> 15/08/05 03:04:12 INFO DAGScheduler: Missing parents for ResultStage 8: 
> List(ShuffleMapStage 7)
> 15/08/05 03:04:12 INFO DAGScheduler: Submitting ShuffleMapStage 7 
> (MapPartitionsRDD[49] at map at :474), which is now runnable
> 15/08/05 03:04:12 INFO MemoryStore: ensureFreeSpace(47840) called with 
> curMem=685306, maxMem=26671746908
> 15/08/05 03:04:12 INFO MemoryStore: Block broadcast_12 stored as values in 
> memory (estimated size 46.7 KB, free 24.8 GB)
> 15/08/05 03:04:12 INFO MemoryStore: ensureFreeSpace(15053) called with 
> curMem=733146, maxMem=26671746908
> 15/08/05 03:04:12 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes 
> in memory (estimated size 14.7 KB, free 24.8 GB)
> 15/08/05 03:04:12 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory 
> on localhost:39451 (size: 14.7 KB, free: 24.8 GB)
> 15/08/05 03:04:12 INFO SparkContext: Created broadcast 12 from broadcast at 
> DAGScheduler.scala:874
> 15/08/05 03:04:12 INFO DAGScheduler: Submitting 1 missing tasks from 
> ShuffleMapStage 7 (MapPartitionsRDD[49] at map at :474)
> 15/08/05 03:04:12 INFO TaskSchedulerImpl: Adding task set 7.0 with 1 tasks
> 15/08/05 03:04:12 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 
> 684, localhost, PROCESS_LOCAL, 1461 bytes)
> 15/08/05 03:04:12 INFO Executor: Running task 0.0 in stage 7.0 (TID 684)
> 15/08/05 03:04:12 INFO ShuffleBlockFetcherIterator: Getting 214 non-empty 
> blocks out of 214 blocks
> 15/08/05 03:04:12 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches 
> in 1 ms
> 15/08/05 03:04:12 INFO HiveContext: Initializing execution hive, version 
> 0.13.1
> 15/08/05 03:04:13 INFO HiveMetaStore: No user is added in admin role, since 
> config is empty
> 15/08/05 03:04:13 INFO SessionState: No Tez session required at this point. 
> hive.execution.engine=mr.
> 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
> 0.13.1
> 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
> 0.13.1
> 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
> 0.13.1
> 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
> 0.13.1
> 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
> 0.13.1
> 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
> 0.13.1
> 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
> 0.13.1
> 15/08/05 03:04:13 INFO SparkILoop: Created sql context (with Hive support)..
> SQL context available as sqlContext.
> 15/08/05 03:04:13 INFO HiveContext: Initializing execution hive, version 
> 0.13.1
> 15/08/05 03:04:13 INFO SparkILoop: Created 

[jira] [Updated] (SPARK-11031) SparkR str() method on DataFrame objects

2016-01-15 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-11031:
--
Assignee: Oscar D. Lara Yejas

> SparkR str() method on DataFrame objects
> 
>
> Key: SPARK-11031
> URL: https://issues.apache.org/jira/browse/SPARK-11031
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>Assignee: Oscar D. Lara Yejas
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch

2016-01-15 Thread Sanket Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101995#comment-15101995
 ] 

Sanket Reddy commented on SPARK-6166:
-

Hi, I modified the code to fit the latest Spark build, I will have the patch up 
soon.

> Add config to limit number of concurrent outbound connections for shuffle 
> fetch
> ---
>
> Key: SPARK-6166
> URL: https://issues.apache.org/jira/browse/SPARK-6166
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Mridul Muralidharan
>Assignee: Shixiong Zhu
>Priority: Minor
>
> spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of 
> size.
> But this is not always sufficient : when the number of hosts in the cluster 
> increase, this can lead to very large number of in-bound connections to one 
> more nodes - causing workers to fail under the load.
> I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on 
> number of outstanding outbound connections.
> This might still cause hotspots in the cluster, but in our tests this has 
> significantly reduced the occurance of worker failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-01-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102243#comment-15102243
 ] 

Sean Owen commented on SPARK-10528:
---

This JIRA tracks it already. More JIRAs don't help; they tend to diffuse the 
conversation and doesn't make something happen.

> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-01-15 Thread Amir Gur (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102247#comment-15102247
 ] 

Amir Gur commented on SPARK-10528:
--

Sure, that's fine, let's find the spark level solution on this one then.

> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-01-15 Thread Amir Gur (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102212#comment-15102212
 ] 

Amir Gur edited comment on SPARK-10528 at 1/15/16 6:24 PM:
---

As long as there is no spark workaround nor root-cause-hive-level solution, 
then what's going to track the work still needed on this which is still 
blocking users?
 
How about we keep this one as an opened rollup, open two new ones under it, one 
focused at finding a spark-level solution/workaround, second a blocking link 
into a hive issue?  Or use this one for finding the spark level solution then 
no need for a new ticket.

If you prefer this one specifically closed that's fine too, as long as other 
opened tickets are tracking all those and are linked here.  (Though I find that 
a bit more confusing that the detailed one which is the one a google search 
would lead to will appear closed.  Then ppl will still figure it out on a 
second look at the *Issue Links* of the current issue - so that's ok too.  
Anything works as long as we got active tickets to solve the problems!)


was (Author: agur):
As long as there is no spark workaround + nor root-cause-hive-level solutiuon, 
then what's going to track the work still needed on this which is still 
blocking users?
 
How about we keep this one as an opened rollup, open two new ones under it, one 
focused at finding a spark-level solution/workaround, second a blocking link 
into a hive issue?

If you prefer this one specifically closed that's fine too, as long as other 
opened tickets are tracking all those and are linked here.  (Though I find that 
a bit more confusing that the detailed one which is the one a google search 
would lead to will appear closed.  Then ppl will still figure it out on a 
second look at the *Issue Links* of the current issue - so that's ok too.  
Anything works as long as we got active tickets to solve the problems!)

> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12783) Dataset map serialization error

2016-01-15 Thread Muthu Jayakumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102351#comment-15102351
 ] 

Muthu Jayakumar commented on SPARK-12783:
-

I tried the following, but got similar error...

{code}
case class MyMap(map: scala.collection.immutable.Map[String, String])

case class TestCaseClass(a: String, b: String){
  def toMyMap: MyMap = {
MyMap(Map(a->b))
  }

  def toStr: String = {
a
  }
}

//main thread...
val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
TestCaseClass("2015-05-01", "data2"))).toDF() 
//.withColumn("swh_date_to_common_request_id_map", f1(col("_1"), 
col("_2"))).drop("_1").drop("_2")
  df1.as[TestCaseClass].map(_.toStr).show() //works fine
  df1.as[TestCaseClass].map(_.toMyMap).show() //error
  df1.as[TestCaseClass].map(each=> each.a -> each.b).show() //works fine
{code}

> Dataset map serialization error
> ---
>
> Key: SPARK-12783
> URL: https://issues.apache.org/jira/browse/SPARK-12783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Muthu Jayakumar
>Assignee: Wenchen Fan
>Priority: Critical
>
> When Dataset API is used to map to another case class, an error is thrown.
> {code}
> case class MyMap(map: Map[String, String])
> case class TestCaseClass(a: String, b: String){
>   def toMyMap: MyMap = {
> MyMap(Map(a->b))
>   }
>   def toStr: String = {
> a
>   }
> }
> //Main method section below
> import sqlContext.implicits._
> val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
> TestCaseClass("2015-05-01", "data2"))).toDF()
> df1.as[TestCaseClass].map(_.toStr).show() //works fine
> df1.as[TestCaseClass].map(_.toMyMap).show() //fails
> {code}
> Error message:
> {quote}
> Caused by: java.io.NotSerializableException: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1
> Serialization stack:
>   - object not serializable (class: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1, value: 
> package lang)
>   - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: 
> class scala.reflect.internal.Symbols$Symbol)
>   - object (class scala.reflect.internal.Types$UniqueThisType, 
> java.lang.type)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: 
> class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$ClassNoArgsTypeRef, String)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: normalized, 
> type: class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$AliasNoArgsTypeRef, String)
>   - field (class: 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1, 
> type: class scala.reflect.api.Types$TypeApi)
>   - object (class 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, )
>   - field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, 
> name: function, type: interface scala.Function1)
>   - object (class org.apache.spark.sql.catalyst.expressions.MapObjects, 
> mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType))
>   - field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: 
> targetObject, type: class 
> org.apache.spark.sql.catalyst.expressions.Expression)
>   - object (class org.apache.spark.sql.catalyst.expressions.Invoke, 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)))
>   - writeObject data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.List$SerializationProxy, 
> scala.collection.immutable.List$SerializationProxy@4c7e3aab)
>   - writeReplace data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.$colon$colon, 
> List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)), 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  

[jira] [Commented] (SPARK-12783) Dataset map serialization error

2016-01-15 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102361#comment-15102361
 ] 

kevin yu commented on SPARK-12783:
--

Hello Muthu: do the import first, it seems working.
scala> import scala.collection.Map
import scala.collection.Map



scala> case class MyMap(map: Map[String, String]) 
defined class MyMap

scala> 

scala> case class TestCaseClass(a: String, b: String)  {
 |   def toMyMap: MyMap = {
 | MyMap(Map(a->b))
 |   }
 | 
 |   def toStr: String = {
 | a
 |   }
 | }
defined class TestCaseClass

scala> val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", 
"data1"), TestCaseClass("2015-05-01", "data2"))).toDF()
df1: org.apache.spark.sql.DataFrame = [a: string, b: string]

scala> df1.as[TestCaseClass].map(_.toMyMap).show() 
++  
| map|
++
|Map(2015-05-01 ->...|
|Map(2015-05-01 ->...|
++


> Dataset map serialization error
> ---
>
> Key: SPARK-12783
> URL: https://issues.apache.org/jira/browse/SPARK-12783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Muthu Jayakumar
>Assignee: Wenchen Fan
>Priority: Critical
>
> When Dataset API is used to map to another case class, an error is thrown.
> {code}
> case class MyMap(map: Map[String, String])
> case class TestCaseClass(a: String, b: String){
>   def toMyMap: MyMap = {
> MyMap(Map(a->b))
>   }
>   def toStr: String = {
> a
>   }
> }
> //Main method section below
> import sqlContext.implicits._
> val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
> TestCaseClass("2015-05-01", "data2"))).toDF()
> df1.as[TestCaseClass].map(_.toStr).show() //works fine
> df1.as[TestCaseClass].map(_.toMyMap).show() //fails
> {code}
> Error message:
> {quote}
> Caused by: java.io.NotSerializableException: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1
> Serialization stack:
>   - object not serializable (class: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1, value: 
> package lang)
>   - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: 
> class scala.reflect.internal.Symbols$Symbol)
>   - object (class scala.reflect.internal.Types$UniqueThisType, 
> java.lang.type)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: 
> class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$ClassNoArgsTypeRef, String)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: normalized, 
> type: class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$AliasNoArgsTypeRef, String)
>   - field (class: 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1, 
> type: class scala.reflect.api.Types$TypeApi)
>   - object (class 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, )
>   - field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, 
> name: function, type: interface scala.Function1)
>   - object (class org.apache.spark.sql.catalyst.expressions.MapObjects, 
> mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType))
>   - field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: 
> targetObject, type: class 
> org.apache.spark.sql.catalyst.expressions.Expression)
>   - object (class org.apache.spark.sql.catalyst.expressions.Invoke, 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)))
>   - writeObject data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.List$SerializationProxy, 
> scala.collection.immutable.List$SerializationProxy@4c7e3aab)
>   - writeReplace data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.$colon$colon, 
> List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)), 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: 

[jira] [Created] (SPARK-12840) Support pass any object into codegen as reference

2016-01-15 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12840:
--

 Summary: Support pass any object into codegen as reference
 Key: SPARK-12840
 URL: https://issues.apache.org/jira/browse/SPARK-12840
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu
Assignee: Davies Liu


Right now, we only support expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12835) StackOverflowError when aggregating over column from window function

2016-01-15 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102358#comment-15102358
 ] 

Herman van Hovell commented on SPARK-12835:
---

Kalle, you are not wrong, this should work. Although using a non-partitioned 
window can lead to serious performance problems (all data will be shipped to a 
single node).

Could you attach the stack trace to the JIRA? That would help diagnosing this 
problem.

> StackOverflowError when aggregating over column from window function
> 
>
> Key: SPARK-12835
> URL: https://issues.apache.org/jira/browse/SPARK-12835
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Kalle Jepsen
>
> I am encountering a StackoverflowError with a very long traceback, when I try 
> to directly aggregate on a column created by a window function.
> E.g. I am trying to determine the average timespan between dates in a 
> Dataframe column by using a window-function:
> {code}
> from pyspark import SparkContext
> from pyspark.sql import HiveContext, Window, functions
> from datetime import datetime
> sc = SparkContext()
> sq = HiveContext(sc)
> data = [
> [datetime(2014,1,1)],
> [datetime(2014,2,1)],
> [datetime(2014,3,1)],
> [datetime(2014,3,6)],
> [datetime(2014,8,23)],
> [datetime(2014,10,1)],
> ]
> df = sq.createDataFrame(data, schema=['ts'])
> ts = functions.col('ts')
>
> w = Window.orderBy(ts)
> diff = functions.datediff(
> ts,
> functions.lag(ts, count=1).over(w)
> )
> avg_diff = functions.avg(diff)
> {code}
> While {{df.select(diff.alias('diff')).show()}} correctly renders as
> {noformat}
> ++
> |diff|
> ++
> |null|
> |  31|
> |  28|
> |   5|
> | 170|
> |  39|
> ++
> {noformat}
> doing {code}
> df.select(avg_diff).show()
> {code} throws a {{java.lang.StackOverflowError}}.
> When I say
> {code}
> df2 = df.select(diff.alias('diff'))
> df2.select(functions.avg('diff'))
> {code}
> however, there's no error.
> Am I wrong to assume that the above should work?
> I've already described the same in [this question on 
> stackoverflow.com|http://stackoverflow.com/questions/34793999/averaging-over-window-function-leads-to-stackoverflowerror].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

2016-01-15 Thread Amir Gur (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102212#comment-15102212
 ] 

Amir Gur commented on SPARK-10528:
--

As long as there is no spark workaround + nor root-cause-hive-level solutiuon, 
then what's going to track the work still needed on this which is still 
blocking users?
 
How about we keep this one as an opened rollup, open two new ones under it, one 
focused at finding a spark-level solution/workaround, second a blocking link 
into a hive issue?

If you prefer this one specifically closed that's fine too, as long as other 
opened tickets are tracking all those and are linked here.  (Though I find that 
a bit more confusing that the detailed one which is the one a google search 
would lead to will appear closed.  Then ppl will still figure it out on a 
second look at the *Issue Links* of the current issue - so that's ok too.  
Anything works as long as we got active tickets to solve the problems!)

> spark-shell throws java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable.
> --
>
> Key: SPARK-10528
> URL: https://issues.apache.org/jira/browse/SPARK-10528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Windows 7 x64
>Reporter: Aliaksei Belablotski
>Priority: Minor
>
> Starting spark-shell throws
> java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: 
> /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2016-01-15 Thread Greg Senia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102267#comment-15102267
 ] 

Greg Senia commented on SPARK-5159:
---

[~zhanzhang], [~luciano resende] and [~ilovesoup] I think this is part of the 
larger issue of kerberos secured datasets in a cluster whether as RDD's with 
Spark or longer running transactions with LLAP and Hive. I think being able to 
share datasets between users based on say group membership would be a great 
answer but I'm guessing some things would need some re-design to make it work.

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
> Attachments: spark_thrift_server_log.txt
>
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12839) Implement CoSelect for Feature Selection and Instance Selection

2016-01-15 Thread Morgan Funtowicz (JIRA)
Morgan Funtowicz created SPARK-12839:


 Summary: Implement CoSelect for Feature Selection and Instance 
Selection
 Key: SPARK-12839
 URL: https://issues.apache.org/jira/browse/SPARK-12839
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Morgan Funtowicz
Priority: Minor


I recently implemented CoSelect Framework 
(http://www.public.asu.edu/~jtang20/publication/coselect.pdf) in Matlab for 
internal use in my school. 

As this framework is able to perform both Feature Selection and Instance 
Selection on social data or not, I'm thinking he could be interesting to 
enhance Spark ML/MLlib in these fields.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12839) Implement CoSelect for Feature Selection and Instance Selection

2016-01-15 Thread Morgan Funtowicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102295#comment-15102295
 ] 

Morgan Funtowicz commented on SPARK-12839:
--

If you feel interested by this feature, I would like to have a try.

> Implement CoSelect for Feature Selection and Instance Selection
> ---
>
> Key: SPARK-12839
> URL: https://issues.apache.org/jira/browse/SPARK-12839
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Morgan Funtowicz
>Priority: Minor
>
> I recently implemented CoSelect Framework 
> (http://www.public.asu.edu/~jtang20/publication/coselect.pdf) in Matlab for 
> internal use in my school. 
> As this framework is able to perform both Feature Selection and Instance 
> Selection on social data or not, I'm thinking he could be interesting to 
> enhance Spark ML/MLlib in these fields.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12783) Dataset map serialization error

2016-01-15 Thread Muthu Jayakumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102351#comment-15102351
 ] 

Muthu Jayakumar edited comment on SPARK-12783 at 1/15/16 7:34 PM:
--

I tried the following, but got similar error...

{code}
case class MyMap(map: scala.collection.immutable.Map[String, String])

case class TestCaseClass(a: String, b: String){
  def toMyMap: MyMap = {
MyMap(Map(a->b))
  }


  def toStr: String = {
a
  }
}

//main thread...
val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
TestCaseClass("2015-05-01", "data2"))).toDF() 
//.withColumn("swh_date_to_common_request_id_map", f1(col("_1"), 
col("_2"))).drop("_1").drop("_2")
  df1.as[TestCaseClass].map(_.toStr).show() //works fine
  df1.as[TestCaseClass].map(_.toMyMap).show() //error
  df1.as[TestCaseClass].map(each=> each.a -> each.b).show() //works fine
{code}

{quote}
Serialization stack:
- object not serializable (class: 
scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1, value: 
package lang)
- field (class: scala.reflect.internal.Types$ThisType, name: sym, type: 
class scala.reflect.internal.Symbols$Symbol)
- object (class scala.reflect.internal.Types$UniqueThisType, 
java.lang.type)
- field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: 
class scala.reflect.internal.Types$Type)
- object (class scala.reflect.internal.Types$ClassNoArgsTypeRef, String)
- field (class: scala.reflect.internal.Types$TypeRef, name: normalized, 
type: class scala.reflect.internal.Types$Type)
- object (class scala.reflect.internal.Types$AliasNoArgsTypeRef, String)
- field (class: 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1, 
type: class scala.reflect.api.Types$TypeApi)
- object (class 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, )
- field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, 
name: function, type: interface scala.Function1)
- object (class org.apache.spark.sql.catalyst.expressions.MapObjects, 
mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- 
field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
"collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType))
- field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: 
targetObject, type: class org.apache.spark.sql.catalyst.expressions.Expression)
- object (class org.apache.spark.sql.catalyst.expressions.Invoke, 
invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
 field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
"collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
 [Ljava.lang.Object;)))
- writeObject data (class: 
scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, 
scala.collection.immutable.List$SerializationProxy@2660f093)
- writeReplace data (class: 
scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.$colon$colon, 
List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
 field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
"collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
 [Ljava.lang.Object;)), 
invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
 field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
"collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
 [Ljava.lang.Object;
- field (class: org.apache.spark.sql.catalyst.expressions.StaticInvoke, 
name: arguments, type: interface scala.collection.Seq)
- object (class org.apache.spark.sql.catalyst.expressions.StaticInvoke, 
staticinvoke(class 
org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface 
scala.collection.Map),toScalaMap,invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
 field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
"collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
 
[Ljava.lang.Object;)),invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
 field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
"collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
 [Ljava.lang.Object;)),true))
- writeObject data (class: 
scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, 
scala.collection.immutable.List$SerializationProxy@72af5ac7)
- 

[jira] [Commented] (SPARK-12833) Initial import of databricks/spark-csv

2016-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102407#comment-15102407
 ] 

Apache Spark commented on SPARK-12833:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/10774

> Initial import of databricks/spark-csv
> --
>
> Key: SPARK-12833
> URL: https://issues.apache.org/jira/browse/SPARK-12833
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Hossein Falaki
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102430#comment-15102430
 ] 

Sean Owen commented on SPARK-12807:
---

Are you asking if it's possible, a possible explanation, a workaround?
I'm still not sure why it's a problem (now). For example people seem to be 
running Spark shuffle just fine with recent Hadoop.

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12667) Remove block manager's internal "external block store" API

2016-01-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-12667.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10752
[https://github.com/apache/spark/pull/10752]

> Remove block manager's internal "external block store" API
> --
>
> Key: SPARK-12667
> URL: https://issues.apache.org/jira/browse/SPARK-12667
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102416#comment-15102416
 ] 

Maciej Bryński commented on SPARK-12807:


Sean,
Maybe it's possible to compile YARN Shuffle with different version of Jackson 
than version using by Spark Core ?

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12842) Add Hadoop 2.7 build profile

2016-01-15 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-12842:
--

 Summary: Add Hadoop 2.7 build profile
 Key: SPARK-12842
 URL: https://issues.apache.org/jira/browse/SPARK-12842
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Josh Rosen
Assignee: Josh Rosen


We should add a Hadoop 2.7 build profile so that we can automate tests against 
Hadoop 2.7.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102441#comment-15102441
 ] 

Maciej Bryński commented on SPARK-12807:


I'm asking if it's possible.

About running Spark shuffle. Did you miss link to: 
https://issues.apache.org/jira/browse/SPARK-9439 ?
Problem started with Spark 1.6.0, because it's first version of Spark where 
Spark Shuffle has Jackson dependency


> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12833) Initial import of databricks/spark-csv

2016-01-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12833.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Initial import of databricks/spark-csv
> --
>
> Key: SPARK-12833
> URL: https://issues.apache.org/jira/browse/SPARK-12833
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Hossein Falaki
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12835) StackOverflowError when aggregating over column from window function

2016-01-15 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102386#comment-15102386
 ] 

Herman van Hovell commented on SPARK-12835:
---

I can reproduce your problem with the following scala code:
{noformat}
import java.sql.Date

import org.apache.spark.sql.expressions.Window

val df = Seq(
(Date.valueOf("2014-01-01")),
(Date.valueOf("2014-02-01")),
(Date.valueOf("2014-03-01")),
(Date.valueOf("2014-03-06")),
(Date.valueOf("2014-08-23")),
(Date.valueOf("2014-10-01"))).
map(Tuple1.apply).
toDF("ts")

// This doesn't work:
df.select(avg(datediff($"ts", lag($"ts", 1).over(Window.orderBy($"ts").show

// This does work:
df.select(datediff($"ts", lag($"ts", 1).over(Window.orderBy($"ts"))).as("diff"))
  .select(avg($"diff"))
  .show
{noformat}

It seems there is a small bug in the analyzer.

> StackOverflowError when aggregating over column from window function
> 
>
> Key: SPARK-12835
> URL: https://issues.apache.org/jira/browse/SPARK-12835
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Kalle Jepsen
>
> I am encountering a StackoverflowError with a very long traceback, when I try 
> to directly aggregate on a column created by a window function.
> E.g. I am trying to determine the average timespan between dates in a 
> Dataframe column by using a window-function:
> {code}
> from pyspark import SparkContext
> from pyspark.sql import HiveContext, Window, functions
> from datetime import datetime
> sc = SparkContext()
> sq = HiveContext(sc)
> data = [
> [datetime(2014,1,1)],
> [datetime(2014,2,1)],
> [datetime(2014,3,1)],
> [datetime(2014,3,6)],
> [datetime(2014,8,23)],
> [datetime(2014,10,1)],
> ]
> df = sq.createDataFrame(data, schema=['ts'])
> ts = functions.col('ts')
>
> w = Window.orderBy(ts)
> diff = functions.datediff(
> ts,
> functions.lag(ts, count=1).over(w)
> )
> avg_diff = functions.avg(diff)
> {code}
> While {{df.select(diff.alias('diff')).show()}} correctly renders as
> {noformat}
> ++
> |diff|
> ++
> |null|
> |  31|
> |  28|
> |   5|
> | 170|
> |  39|
> ++
> {noformat}
> doing {code}
> df.select(avg_diff).show()
> {code} throws a {{java.lang.StackOverflowError}}.
> When I say
> {code}
> df2 = df.select(diff.alias('diff'))
> df2.select(functions.avg('diff'))
> {code}
> however, there's no error.
> Am I wrong to assume that the above should work?
> I've already described the same in [this question on 
> stackoverflow.com|http://stackoverflow.com/questions/34793999/averaging-over-window-function-leads-to-stackoverflowerror].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12841) UnresolvedException with cast

2016-01-15 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-12841:


 Summary: UnresolvedException with cast
 Key: SPARK-12841
 URL: https://issues.apache.org/jira/browse/SPARK-12841
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Michael Armbrust
Assignee: Wenchen Fan
Priority: Blocker


{code}
val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
df1.where(df1.col("single").cast("string").equalTo("1"))
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12842) Add Hadoop 2.7 build profile

2016-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102424#comment-15102424
 ] 

Apache Spark commented on SPARK-12842:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10775

> Add Hadoop 2.7 build profile
> 
>
> Key: SPARK-12842
> URL: https://issues.apache.org/jira/browse/SPARK-12842
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> We should add a Hadoop 2.7 build profile so that we can automate tests 
> against Hadoop 2.7.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102441#comment-15102441
 ] 

Maciej Bryński edited comment on SPARK-12807 at 1/15/16 8:43 PM:
-

I'm asking if it's possible.

About running Spark shuffle. Did you miss link to: 
https://issues.apache.org/jira/browse/SPARK-9439 ?
Problem started with Spark 1.6.0, because it's first version of Spark where 
Shuffle has Jackson dependency



was (Author: maver1ck):
I'm asking if it's possible.

About running Spark shuffle. Did you miss link to: 
https://issues.apache.org/jira/browse/SPARK-9439 ?
Problem started with Spark 1.6.0, because it's first version of Spark where 
Spark Shuffle has Jackson dependency


> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12843) Spark should avoid scanning all partitions when limit is set

2016-01-15 Thread JIRA
Maciej Bryński created SPARK-12843:
--

 Summary: Spark should avoid scanning all partitions when limit is 
set
 Key: SPARK-12843
 URL: https://issues.apache.org/jira/browse/SPARK-12843
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Maciej Bryński


SQL Query:
{code}
select * from table limit 100
{code}
force Spark to scan all partition even when data are available on the beginning 
of scan.

Is it related to: [SPARK-9850] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12843) Spark should avoid scanning all partitions when limit is set

2016-01-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-12843:
---
Description: 
SQL Query:
{code}
select * from table limit 100
{code}
force Spark to scan all partition even when data are available on the beginning 
of scan.

This behaviour should be avoided and scan should stop when enough data is 
collected.

Is it related to: [SPARK-9850] ?

  was:
SQL Query:
{code}
select * from table limit 100
{code}
force Spark to scan all partition even when data are available on the beginning 
of scan.

Is it related to: [SPARK-9850] ?


> Spark should avoid scanning all partitions when limit is set
> 
>
> Key: SPARK-12843
> URL: https://issues.apache.org/jira/browse/SPARK-12843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> SQL Query:
> {code}
> select * from table limit 100
> {code}
> force Spark to scan all partition even when data are available on the 
> beginning of scan.
> This behaviour should be avoided and scan should stop when enough data is 
> collected.
> Is it related to: [SPARK-9850] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12701) Logging FileAppender should use join to ensure thread is finished

2016-01-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12701:
--
Fix Version/s: 1.6.1

> Logging FileAppender should use join to ensure thread is finished
> -
>
> Key: SPARK-12701
> URL: https://issues.apache.org/jira/browse/SPARK-12701
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>
> Currently, FileAppender for logging uses wait/notifyAll to signal that the 
> writing thread has finished.  While I was trying to write a regression test 
> for a fix of SPARK-9844, the writing thread was not able to fully complete 
> before the process was shutdown, despite calling 
> {{FileAppender.awaitTermination}}.  Using join ensures the thread completes 
> and would simplify things a little more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10985) Avoid passing evicted blocks throughout BlockManager / CacheManager

2016-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102477#comment-15102477
 ] 

Apache Spark commented on SPARK-10985:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10776

> Avoid passing evicted blocks throughout BlockManager / CacheManager
> ---
>
> Key: SPARK-10985
> URL: https://issues.apache.org/jira/browse/SPARK-10985
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Andrew Or
>Priority: Minor
>
> This is a minor refactoring task.
> Currently when we attempt to put a block in, we get back an array buffer of 
> blocks that are dropped in the process. We do this to propagate these blocks 
> back to our TaskContext, which will add them to its TaskMetrics so we can see 
> them in the SparkUI storage tab properly.
> Now that we have TaskContext.get, we can just use that to propagate this 
> information. This simplifies a lot of the signatures and gets rid of weird 
> return types like the following everywhere:
> {code}
> ArrayBuffer[(BlockId, BlockStatus)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10985) Avoid passing evicted blocks throughout BlockManager / CacheManager

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10985:


Assignee: (was: Apache Spark)

> Avoid passing evicted blocks throughout BlockManager / CacheManager
> ---
>
> Key: SPARK-10985
> URL: https://issues.apache.org/jira/browse/SPARK-10985
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Andrew Or
>Priority: Minor
>
> This is a minor refactoring task.
> Currently when we attempt to put a block in, we get back an array buffer of 
> blocks that are dropped in the process. We do this to propagate these blocks 
> back to our TaskContext, which will add them to its TaskMetrics so we can see 
> them in the SparkUI storage tab properly.
> Now that we have TaskContext.get, we can just use that to propagate this 
> information. This simplifies a lot of the signatures and gets rid of weird 
> return types like the following everywhere:
> {code}
> ArrayBuffer[(BlockId, BlockStatus)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10985) Avoid passing evicted blocks throughout BlockManager / CacheManager

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10985:


Assignee: Apache Spark

> Avoid passing evicted blocks throughout BlockManager / CacheManager
> ---
>
> Key: SPARK-10985
> URL: https://issues.apache.org/jira/browse/SPARK-10985
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Minor
>
> This is a minor refactoring task.
> Currently when we attempt to put a block in, we get back an array buffer of 
> blocks that are dropped in the process. We do this to propagate these blocks 
> back to our TaskContext, which will add them to its TaskMetrics so we can see 
> them in the SparkUI storage tab properly.
> Now that we have TaskContext.get, we can just use that to propagate this 
> information. This simplifies a lot of the signatures and gets rid of weird 
> return types like the following everywhere:
> {code}
> ArrayBuffer[(BlockId, BlockStatus)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12783) Dataset map serialization error

2016-01-15 Thread Muthu Jayakumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102482#comment-15102482
 ] 

Muthu Jayakumar commented on SPARK-12783:
-

Hello Kevin,

Here is what I am seeing...

from shell:
{code}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
  /_/

Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.

scala> case class MyMap(map: Map[String, String])
defined class MyMap

scala> :paste
// Entering paste mode (ctrl-D to finish)

case class TestCaseClass(a: String, b: String){
  def toMyMap: MyMap = {
MyMap(Map(a->b))
  }

  def toStr: String = {
a
  }
}

// Exiting paste mode, now interpreting.

defined class TestCaseClass

scala> TestCaseClass("a", "nn")
res4: TestCaseClass = TestCaseClass(a,nn)

scala>   import sqlContext.implicits._
import sqlContext.implicits._

scala> val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", 
"data1"), TestCaseClass("2015-05-01", "data2"))).toDF()
org.apache.spark.sql.AnalysisException: Unable to generate an encoder for inner 
class `TestCaseClass` without access to the scope that this class was defined 
in. Try moving this class out of its parent class.;
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$2.applyOrElse(ExpressionEncoder.scala:264)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$2.applyOrElse(ExpressionEncoder.scala:260)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:242)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:233)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.resolve(ExpressionEncoder.scala:260)
  at org.apache.spark.sql.Dataset.(Dataset.scala:78)
  at org.apache.spark.sql.Dataset.(Dataset.scala:89)
  at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:507)
  ... 52 elided
{code}

I do remember seeing the above error stack, if the case class was defined 
inside the scope of an object (For example: If defined inside MyApp like in the 
example below as it becomes an inner class)
>From code, I added an explicit import and eventually changed to use fully 
>qualified class names like below...

{code}
import scala.collection.{Map => ImMap}

case class MyMap(map: ImMap[String, String])

case class TestCaseClass(a: String, b: String){
  def toMyMap: MyMap = {
MyMap(ImMap(a->b))
  }

  def toStr: String = {
a
  }
}

object MyApp extends App { 
 //Get handle to contexts...
 import sqlContext.implicits._
  val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
TestCaseClass("2015-05-01", "data2"))).toDF()
  df1.as[TestCaseClass].map(_.toStr).show() //works fine
  df1.as[TestCaseClass].map(_.toMyMap).show() //error
}

{code}

and

{code}
case class MyMap(map: scala.collection.Map[String, String])

case class TestCaseClass(a: String, b: String){
  def toMyMap: MyMap = {
MyMap(scala.collection.Map(a->b))
  }

  def toStr: String = {
a
  }
}

object MyApp extends App { 
 //Get handle to contexts...
 import sqlContext.implicits._
  val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
TestCaseClass("2015-05-01", "data2"))).toDF()
  df1.as[TestCaseClass].map(_.toStr).show() //works fine
  df1.as[TestCaseClass].map(_.toMyMap).show() //error
}

{code}

Please advice on what I may be missing. I misread the earlier comment and tried 
to use immutable map incorrectly :(. 

> Dataset map serialization error
> ---
>
> Key: SPARK-12783
> URL: https://issues.apache.org/jira/browse/SPARK-12783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Muthu Jayakumar
>Assignee: Wenchen Fan
>Priority: Critical
>
> When Dataset API is used to map to another case class, an error is thrown.
> {code}
> case class MyMap(map: Map[String, String])
> case class TestCaseClass(a: String, b: String){
>   def toMyMap: MyMap = {
> MyMap(Map(a->b))
>   }
>   def toStr: String = {
> a
>   }
> }
> //Main method section below
> import sqlContext.implicits._
> val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
> TestCaseClass("2015-05-01", "data2"))).toDF()
> df1.as[TestCaseClass].map(_.toStr).show() //works fine
> df1.as[TestCaseClass].map(_.toMyMap).show() //fails
> {code}
> Error message:
> {quote}
> Caused by: 

[jira] [Comment Edited] (SPARK-12783) Dataset map serialization error

2016-01-15 Thread Muthu Jayakumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102351#comment-15102351
 ] 

Muthu Jayakumar edited comment on SPARK-12783 at 1/15/16 9:09 PM:
--

I tried the following, but got similar error...

{code}
case class MyMap(map: scala.collection.immutable.Map[String, String])

case class TestCaseClass(a: String, b: String){
  def toMyMap: MyMap = {
MyMap(Map(a->b))
  }


  def toStr: String = {
a
  }
}

//main thread...
val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
TestCaseClass("2015-05-01", "data2"))).toDF() 
  df1.as[TestCaseClass].map(_.toStr).show() //works fine
  df1.as[TestCaseClass].map(_.toMyMap).show() //error
  df1.as[TestCaseClass].map(each=> each.a -> each.b).show() //works fine
{code}

{quote}
Serialization stack:
- object not serializable (class: 
scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1, value: 
package lang)
- field (class: scala.reflect.internal.Types$ThisType, name: sym, type: 
class scala.reflect.internal.Symbols$Symbol)
- object (class scala.reflect.internal.Types$UniqueThisType, 
java.lang.type)
- field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: 
class scala.reflect.internal.Types$Type)
- object (class scala.reflect.internal.Types$ClassNoArgsTypeRef, String)
- field (class: scala.reflect.internal.Types$TypeRef, name: normalized, 
type: class scala.reflect.internal.Types$Type)
- object (class scala.reflect.internal.Types$AliasNoArgsTypeRef, String)
- field (class: 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1, 
type: class scala.reflect.api.Types$TypeApi)
- object (class 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, )
- field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, 
name: function, type: interface scala.Function1)
- object (class org.apache.spark.sql.catalyst.expressions.MapObjects, 
mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- 
field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
"collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType))
- field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: 
targetObject, type: class org.apache.spark.sql.catalyst.expressions.Expression)
- object (class org.apache.spark.sql.catalyst.expressions.Invoke, 
invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
 field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
"collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
 [Ljava.lang.Object;)))
- writeObject data (class: 
scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, 
scala.collection.immutable.List$SerializationProxy@2660f093)
- writeReplace data (class: 
scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.$colon$colon, 
List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
 field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
"collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
 [Ljava.lang.Object;)), 
invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
 field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
"collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
 [Ljava.lang.Object;
- field (class: org.apache.spark.sql.catalyst.expressions.StaticInvoke, 
name: arguments, type: interface scala.collection.Seq)
- object (class org.apache.spark.sql.catalyst.expressions.StaticInvoke, 
staticinvoke(class 
org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface 
scala.collection.Map),toScalaMap,invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
 field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
"collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
 
[Ljava.lang.Object;)),invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
 field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
"collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
 [Ljava.lang.Object;)),true))
- writeObject data (class: 
scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, 
scala.collection.immutable.List$SerializationProxy@72af5ac7)
- writeReplace data (class: 
scala.collection.immutable.List$SerializationProxy)
- object (class 

[jira] [Commented] (SPARK-12624) When schema is specified, we should treat undeclared fields as null (in Python)

2016-01-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102473#comment-15102473
 ] 

Maciej Bryński commented on SPARK-12624:


[~davies]
Isn't related to my comment here:
https://issues.apache.org/jira/browse/SPARK-11437?focusedCommentId=15074733=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15074733

> When schema is specified, we should treat undeclared fields as null (in 
> Python)
> ---
>
> Key: SPARK-12624
> URL: https://issues.apache.org/jira/browse/SPARK-12624
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> See https://github.com/apache/spark/pull/10564
> Basically that test case should pass without the above fix and just assume b 
> is null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12624) When schema is specified, we should treat undeclared fields as null (in Python)

2016-01-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102473#comment-15102473
 ] 

Maciej Bryński edited comment on SPARK-12624 at 1/15/16 9:17 PM:
-

[~davies]
Isn't related to my comment here:
https://issues.apache.org/jira/browse/SPARK-11437?focusedCommentId=15068627=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15068627


was (Author: maver1ck):
[~davies]
Isn't related to my comment here:
https://issues.apache.org/jira/browse/SPARK-11437?focusedCommentId=15074733=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15074733

> When schema is specified, we should treat undeclared fields as null (in 
> Python)
> ---
>
> Key: SPARK-12624
> URL: https://issues.apache.org/jira/browse/SPARK-12624
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> See https://github.com/apache/spark/pull/10564
> Basically that test case should pass without the above fix and just assume b 
> is null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12835) StackOverflowError when aggregating over column from window function

2016-01-15 Thread Kalle Jepsen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102491#comment-15102491
 ] 

Kalle Jepsen commented on SPARK-12835:
--

The [traceback|http://pastebin.com/pRRCAben] really is ridiculously long.

In my actual application I would have the window partitioned and the 
aggregation done in {{df.groupby(key).agg(avg_diff}}. Would that still be 
problematic with regard to performance? The error is the same there though, 
that's why I've chosen the more concise minimal example above.

> StackOverflowError when aggregating over column from window function
> 
>
> Key: SPARK-12835
> URL: https://issues.apache.org/jira/browse/SPARK-12835
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Kalle Jepsen
>
> I am encountering a StackoverflowError with a very long traceback, when I try 
> to directly aggregate on a column created by a window function.
> E.g. I am trying to determine the average timespan between dates in a 
> Dataframe column by using a window-function:
> {code}
> from pyspark import SparkContext
> from pyspark.sql import HiveContext, Window, functions
> from datetime import datetime
> sc = SparkContext()
> sq = HiveContext(sc)
> data = [
> [datetime(2014,1,1)],
> [datetime(2014,2,1)],
> [datetime(2014,3,1)],
> [datetime(2014,3,6)],
> [datetime(2014,8,23)],
> [datetime(2014,10,1)],
> ]
> df = sq.createDataFrame(data, schema=['ts'])
> ts = functions.col('ts')
>
> w = Window.orderBy(ts)
> diff = functions.datediff(
> ts,
> functions.lag(ts, count=1).over(w)
> )
> avg_diff = functions.avg(diff)
> {code}
> While {{df.select(diff.alias('diff')).show()}} correctly renders as
> {noformat}
> ++
> |diff|
> ++
> |null|
> |  31|
> |  28|
> |   5|
> | 170|
> |  39|
> ++
> {noformat}
> doing {code}
> df.select(avg_diff).show()
> {code} throws a {{java.lang.StackOverflowError}}.
> When I say
> {code}
> df2 = df.select(diff.alias('diff'))
> df2.select(functions.avg('diff'))
> {code}
> however, there's no error.
> Am I wrong to assume that the above should work?
> I've already described the same in [this question on 
> stackoverflow.com|http://stackoverflow.com/questions/34793999/averaging-over-window-function-leads-to-stackoverflowerror].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12844) Spark documentation should be more precise about the algebraic properties of functions in various transformations

2016-01-15 Thread Jimmy Lin (JIRA)
Jimmy Lin created SPARK-12844:
-

 Summary: Spark documentation should be more precise about the 
algebraic properties of functions in various transformations
 Key: SPARK-12844
 URL: https://issues.apache.org/jira/browse/SPARK-12844
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Jimmy Lin
Priority: Minor


Spark documentation should be more precise about the algebraic properties of 
functions in various transformations. The way the current documentation is 
written is potentially confusing. For example, in Spark 1.6, the scaladoc for 
reduce in RDD says:

> Reduces the elements of this RDD using the specified commutative and 
> associative binary operator.

This is precise and accurate. In the documentation of reduceByKey in 
PairRDDFunctions, on the other hand, it says:

> Merge the values for each key using an associative reduce function.

To be more precise, this function must also be commutative in order for the 
computation to be correct. Writing commutative for reduce and not reduceByKey 
gives the false impression that the function in the latter does not need to be 
commutative.

The same applies to aggregateByKey. To be precise, both seqOp and combOp need 
to be associative (mentioned) AND commutative (not mentioned) in order for the 
computation to be correct. It would be desirable to fix these inconsistencies 
throughout the documentation.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12845) During join Spark should pushdown predicates to both tables

2016-01-15 Thread JIRA
Maciej Bryński created SPARK-12845:
--

 Summary: During join Spark should pushdown predicates to both 
tables
 Key: SPARK-12845
 URL: https://issues.apache.org/jira/browse/SPARK-12845
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Maciej Bryński


I have following issue.
I'm connecting two tables with where condition
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234
{code}
In this code predicate is only push down to t1.
To have predicates on both table I should run following query which have no 
sense
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 1234
{code}

Spark should present same behaviour for both queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12845) During join Spark should pushdown predicates to both tables

2016-01-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-12845:
---
Description: 
I have following issue.
I'm connecting two tables with where condition
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234
{code}
In this code predicate is only push down to t1.
To have predicates on both table I should run following query:
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 1234
{code}

Spark should present same behaviour for both queries.

  was:
I have following issue.
I'm connecting two tables with where condition
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234
{code}
In this code predicate is only push down to t1.
To have predicates on both table I should run following query which have no 
sense
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 1234
{code}

Spark should present same behaviour for both queries.


> During join Spark should pushdown predicates to both tables
> ---
>
> Key: SPARK-12845
> URL: https://issues.apache.org/jira/browse/SPARK-12845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> I have following issue.
> I'm connecting two tables with where condition
> {code}
> select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234
> {code}
> In this code predicate is only push down to t1.
> To have predicates on both table I should run following query:
> {code}
> select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 
> 1234
> {code}
> Spark should present same behaviour for both queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12843) Spark should avoid scanning all partitions when limit is set

2016-01-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-12843:
---
Issue Type: Bug  (was: Improvement)

> Spark should avoid scanning all partitions when limit is set
> 
>
> Key: SPARK-12843
> URL: https://issues.apache.org/jira/browse/SPARK-12843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> SQL Query:
> {code}
> select * from table limit 100
> {code}
> force Spark to scan all partition even when data are available on the 
> beginning of scan.
> This behaviour should be avoided and scan should stop when enough data is 
> collected.
> Is it related to: [SPARK-9850] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data

2016-01-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-12030:
---
Attachment: (was: t1.tar.gz)

> Incorrect results when aggregate joined data
> 
>
> Key: SPARK-12030
> URL: https://issues.apache.org/jira/browse/SPARK-12030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Assignee: Nong Li
>Priority: Blocker
> Fix For: 1.5.3, 1.6.0
>
> Attachments: spark.jpg, t2.tar.gz
>
>
> I have following issue.
> I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2)
> {code}
> t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache()
> t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache()
> joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer")
> {code}
> Important: both table are cached, so results should be the same on every 
> query.
> Then I did come counts:
> {code}
> t1.count() -> 5900729
> t1.registerTempTable("t1")
> sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729
> t2.count() -> 54298
> joined.count() -> 5900729
> {code}
> And here magic begins - I counted distinct id1 from joined table
> {code}
> joined.registerTempTable("joined")
> sqlCtx.sql("select distinct(id1) from joined").count()
> {code}
> Results varies *(are different on every run)* between 5899000 and 
> 590 but never are equal to 5900729.
> In addition. I did more queries:
> {code}
> sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 
> 1").collect() 
> {code}
> This gives some results but this query return *1*
> {code}
> len(sqlCtx.sql("select * from joined where id1 = result").collect())
> {code}
> What's wrong ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data

2016-01-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-12030:
---
Attachment: (was: spark.jpg)

> Incorrect results when aggregate joined data
> 
>
> Key: SPARK-12030
> URL: https://issues.apache.org/jira/browse/SPARK-12030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Assignee: Nong Li
>Priority: Blocker
> Fix For: 1.5.3, 1.6.0
>
>
> I have following issue.
> I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2)
> {code}
> t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache()
> t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache()
> joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer")
> {code}
> Important: both table are cached, so results should be the same on every 
> query.
> Then I did come counts:
> {code}
> t1.count() -> 5900729
> t1.registerTempTable("t1")
> sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729
> t2.count() -> 54298
> joined.count() -> 5900729
> {code}
> And here magic begins - I counted distinct id1 from joined table
> {code}
> joined.registerTempTable("joined")
> sqlCtx.sql("select distinct(id1) from joined").count()
> {code}
> Results varies *(are different on every run)* between 5899000 and 
> 590 but never are equal to 5900729.
> In addition. I did more queries:
> {code}
> sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 
> 1").collect() 
> {code}
> This gives some results but this query return *1*
> {code}
> len(sqlCtx.sql("select * from joined where id1 = result").collect())
> {code}
> What's wrong ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data

2016-01-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-12030:
---
Attachment: (was: t2.tar.gz)

> Incorrect results when aggregate joined data
> 
>
> Key: SPARK-12030
> URL: https://issues.apache.org/jira/browse/SPARK-12030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Assignee: Nong Li
>Priority: Blocker
> Fix For: 1.5.3, 1.6.0
>
>
> I have following issue.
> I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2)
> {code}
> t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache()
> t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache()
> joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer")
> {code}
> Important: both table are cached, so results should be the same on every 
> query.
> Then I did come counts:
> {code}
> t1.count() -> 5900729
> t1.registerTempTable("t1")
> sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729
> t2.count() -> 54298
> joined.count() -> 5900729
> {code}
> And here magic begins - I counted distinct id1 from joined table
> {code}
> joined.registerTempTable("joined")
> sqlCtx.sql("select distinct(id1) from joined").count()
> {code}
> Results varies *(are different on every run)* between 5899000 and 
> 590 but never are equal to 5900729.
> In addition. I did more queries:
> {code}
> sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 
> 1").collect() 
> {code}
> This gives some results but this query return *1*
> {code}
> len(sqlCtx.sql("select * from joined where id1 = result").collect())
> {code}
> What's wrong ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12835) StackOverflowError when aggregating over column from window function

2016-01-15 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102532#comment-15102532
 ] 

Herman van Hovell commented on SPARK-12835:
---

Thanks for that.

The {{df.groupby(key).agg(avg_diff)}} is problematic. The Lag window function 
doesn't have any partitioning defined so it will move all data to a single 
thread on a single node. The {{diff}} value can also be based on dates with 
different keys.

> StackOverflowError when aggregating over column from window function
> 
>
> Key: SPARK-12835
> URL: https://issues.apache.org/jira/browse/SPARK-12835
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Kalle Jepsen
>
> I am encountering a StackoverflowError with a very long traceback, when I try 
> to directly aggregate on a column created by a window function.
> E.g. I am trying to determine the average timespan between dates in a 
> Dataframe column by using a window-function:
> {code}
> from pyspark import SparkContext
> from pyspark.sql import HiveContext, Window, functions
> from datetime import datetime
> sc = SparkContext()
> sq = HiveContext(sc)
> data = [
> [datetime(2014,1,1)],
> [datetime(2014,2,1)],
> [datetime(2014,3,1)],
> [datetime(2014,3,6)],
> [datetime(2014,8,23)],
> [datetime(2014,10,1)],
> ]
> df = sq.createDataFrame(data, schema=['ts'])
> ts = functions.col('ts')
>
> w = Window.orderBy(ts)
> diff = functions.datediff(
> ts,
> functions.lag(ts, count=1).over(w)
> )
> avg_diff = functions.avg(diff)
> {code}
> While {{df.select(diff.alias('diff')).show()}} correctly renders as
> {noformat}
> ++
> |diff|
> ++
> |null|
> |  31|
> |  28|
> |   5|
> | 170|
> |  39|
> ++
> {noformat}
> doing {code}
> df.select(avg_diff).show()
> {code} throws a {{java.lang.StackOverflowError}}.
> When I say
> {code}
> df2 = df.select(diff.alias('diff'))
> df2.select(functions.avg('diff'))
> {code}
> however, there's no error.
> Am I wrong to assume that the above should work?
> I've already described the same in [this question on 
> stackoverflow.com|http://stackoverflow.com/questions/34793999/averaging-over-window-function-leads-to-stackoverflowerror].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12149) Executor UI improvement suggestions - Color UI

2016-01-15 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-12149:
--
Assignee: Alex Bozarth

> Executor UI improvement suggestions - Color UI
> --
>
> Key: SPARK-12149
> URL: https://issues.apache.org/jira/browse/SPARK-12149
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Alex Bozarth
>Assignee: Alex Bozarth
>
> Splitting off the Color UI portion of the parent UI improvements task, 
> description copied below:
> Fill some of the cells with color in order to make it easier to absorb the 
> info, e.g.
> RED if Failed Tasks greater than 0 (maybe the more failed, the more intense 
> the red)
> GREEN if Active Tasks greater than 0 (maybe more intense the larger the 
> number)
> Possibly color code COMPLETE TASKS using various shades of blue (e.g., based 
> on the log(# completed)
> if dark blue then write the value in white (same for the RED and GREEN above
> Merging another idea from SPARK-2132: 
> Color GC time red when over a percentage of task time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12716) Executor UI improvement suggestions - Totals

2016-01-15 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-12716.
---
   Resolution: Fixed
 Assignee: Alex Bozarth
Fix Version/s: 2.0.0

> Executor UI improvement suggestions - Totals
> 
>
> Key: SPARK-12716
> URL: https://issues.apache.org/jira/browse/SPARK-12716
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Alex Bozarth
>Assignee: Alex Bozarth
> Fix For: 2.0.0
>
>
> Splitting off the Totals portion of the parent UI improvements task, 
> description copied below:
> I received some suggestions from a user for the /executors UI page to make it 
> more helpful. This gets more important when you have a really large number of 
> executors.
> ...
> Report the TOTALS in each column (do this at the TOP so no need to scroll to 
> the bottom, or print both at top and bottom).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11925) Add PySpark missing methods for ml.feature during Spark 1.6 QA

2016-01-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-11925.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 9908
[https://github.com/apache/spark/pull/9908]

> Add PySpark missing methods for ml.feature during Spark 1.6 QA
> --
>
> Key: SPARK-11925
> URL: https://issues.apache.org/jira/browse/SPARK-11925
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> Add PySpark missing methods and params for ml.feature
> * RegexTokenizer should support setting toLowercase.
> * MinMaxScalerModel should support output originalMin and originalMax.
> * PCAModel should support output pc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12847) Remove StreamingListenerBus and post all Streaming events to the same thread as Spark events

2016-01-15 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102740#comment-15102740
 ] 

Shixiong Zhu commented on SPARK-12847:
--

Ah, I think this one should be a sub-task. Let me change it.

> Remove StreamingListenerBus and post all Streaming events to the same thread 
> as Spark events
> 
>
> Key: SPARK-12847
> URL: https://issues.apache.org/jira/browse/SPARK-12847
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> SparkListener.onOtherEvent  was added in 
> https://github.com/apache/spark/pull/10061. SQLListener uses it to dispatch 
> SQL special events instead of creating a new separate listener bus.
> Streaming can also use the similar approach to eliminate the 
> StreamingListenerBus. Right now, nondeterministic message order in two 
> listener buses are really tricky when someone implements both SparkListener 
> and StreamingListener. And if we can use only one listener bus in Spark, the 
> nondeterministic message order will be eliminated and we can also remove a 
> lot of codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12848) Parse number as decimal

2016-01-15 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102763#comment-15102763
 ] 

Herman van Hovell commented on SPARK-12848:
---

Assuming that we are talking about literals here. It is quite easy to change 
the parse defaults for that.

The way it is currently done is that when we find a decimal number, {{1.23}} 
for example, we will convert it into a Double (always). When a user needs a 
Decimal, he (or she) can use a BigDecimal literal for this by tagging the 
number with {{BD}}.

[~davies] I might not be getting the point you are making, but I think we have 
covered this by using BigDecimal literals. Could you provide an example 
otherwise?

> Parse number as decimal
> ---
>
> Key: SPARK-12848
> URL: https://issues.apache.org/jira/browse/SPARK-12848
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>
> Right now, Hive parser will parse 1.23 as double, when it's used with decimal 
> columns, you will turn the decimal into double, lose the precision.
> We should follow most database had done, parse 1.23 as double, it will be 
> converted into double when used with double.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12840) Support pass any object into codegen as reference

2016-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102629#comment-15102629
 ] 

Apache Spark commented on SPARK-12840:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/10777

> Support pass any object into codegen as reference
> -
>
> Key: SPARK-12840
> URL: https://issues.apache.org/jira/browse/SPARK-12840
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, we only support expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12840) Support pass any object into codegen as reference

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12840:


Assignee: Apache Spark  (was: Davies Liu)

> Support pass any object into codegen as reference
> -
>
> Key: SPARK-12840
> URL: https://issues.apache.org/jira/browse/SPARK-12840
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> Right now, we only support expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12840) Support pass any object into codegen as reference

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12840:


Assignee: Davies Liu  (was: Apache Spark)

> Support pass any object into codegen as reference
> -
>
> Key: SPARK-12840
> URL: https://issues.apache.org/jira/browse/SPARK-12840
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, we only support expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102654#comment-15102654
 ] 

Sean Owen commented on SPARK-12807:
---

I see, it's only the shuffle and only 1.6, and only happens to affect the 
shuffle service on YARN. Spark has otherwise been using later Jackson for a 
while. Shading is indeed probably the best thing for all of Spark's usages.

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12840) Support passing arbitrary objects (not just expressions) into code generated classes

2016-01-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12840:

Description: As of now, our code generator only allows passing Expression 
objects into the generated class as arguments. In order to support whole-stage 
codegen (e.g. for broadcast joins), the generated classes need to accept other 
types of objects such as hash tables.  (was: Right now, we only support 
expression.)

> Support passing arbitrary objects (not just expressions) into code generated 
> classes
> 
>
> Key: SPARK-12840
> URL: https://issues.apache.org/jira/browse/SPARK-12840
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> As of now, our code generator only allows passing Expression objects into the 
> generated class as arguments. In order to support whole-stage codegen (e.g. 
> for broadcast joins), the generated classes need to accept other types of 
> objects such as hash tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102883#comment-15102883
 ] 

Steve Loughran commented on SPARK-12807:


There's a PR to shade in trunk; I'm going to do a 1.6 PR too, which should be 
identical (initially for ease of testing that the 1.6 branch is fixed)

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12704) we may repartition a relation even it's not needed

2016-01-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-12704.
---
Resolution: Later

Closing as later. We will revisit this when the time comes.


> we may repartition a relation even it's not needed
> --
>
> Key: SPARK-12704
> URL: https://issues.apache.org/jira/browse/SPARK-12704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> The implementation of {{HashPartitioning.compatibleWith}} has been 
> sub-optimal for a while. Think of the following case:
> if {{table_a}} is hash partitioned by int column `i`, and {{table_b}} is also 
> partitioned by int column `i`, logically these 2 partitionings are 
> compatible. However, {{HashPartitioning.compatibleWith}} will return false 
> for this case as the {{AttributeReference}} of column `i` between these 2 
> tables have different expr ids.
> With this wrong result of {{HashPartitioning.compatibleWith}}, we will go 
> into [this 
> branch|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala#L390]
>  and may add unnecessary shuffle.
> This won't impact correctness if the join keys are exactly the same with hash 
> partitioning keys, as there’s still an opportunity to ​not​ partition that 
> child in that branch: 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala#L428
> However, if the join keys are a super-set of hash partitioning keys, for 
> example, {{table_a}} and {{table_b}} are both hash partitioned by column `i`, 
> and we wanna join them using column `i, j`, logically we don't need shuffle 
> but in fact the 2 tables start out as partitioned only by `i` and redundantly 
> be repartitioned by `i, j`.
> A quick fix is just set the expr id of {{AttributeReference}} to 0 before we 
> call {{this.semanticEquals(o)}} in {{HashPartitioning.compatibleWith}}, but 
> for long term, I think we need a better design than the `compatibleWith`, 
> `guarantees`, and `satisfies` mechanism, as it's quite complex



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12851) Add the ability to understand tables bucketed by Hive

2016-01-15 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12851:
---

 Summary: Add the ability to understand tables bucketed by Hive
 Key: SPARK-12851
 URL: https://issues.apache.org/jira/browse/SPARK-12851
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


We added bucketing functionality, but we current do not understand the 
bucketing properties if a table is generated by Hive. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12852) Support create table DDL with bucketing

2016-01-15 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12852:
---

 Summary: Support create table DDL with bucketing
 Key: SPARK-12852
 URL: https://issues.apache.org/jira/browse/SPARK-12852
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >