date:20170625

[jira] [Updated] (SPARK-21168) KafkaRDD should always set kafka clientId.

2017-06-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21168:
--
Target Version/s:   (was: 2.0.2)

> KafkaRDD should always set kafka clientId.
> --
>
> Key: SPARK-21168
> URL: https://issues.apache.org/jira/browse/SPARK-21168
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Xingxing Di
>Priority: Trivial
>
> I found KafkaRDD not set kafka client.id in "fetchBatch" method 
> (FetchRequestBuilder will set clientId to empty by default),  normally this 
> will affect nothing, but in our case ,we use clientId at kafka server side, 
> so we have to rebuild spark-streaming-kafka。



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18267) Distribute PySpark via Python Package Index (pypi)

2017-06-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18267:
--
Target Version/s:   (was: 2.2.0)

> Distribute PySpark via Python Package Index (pypi)
> --
>
> Key: SPARK-18267
> URL: https://issues.apache.org/jira/browse/SPARK-18267
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Project Infra, PySpark
>Reporter: Reynold Xin
>
> The goal is to distribute PySpark via pypi, so users can simply run Spark on 
> a single node via "pip install pyspark" (or "pip install apache-spark").



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21206) the window slice of Dstream is wrong

2017-06-25 Thread Fei Shao (JIRA)

Fei Shao created SPARK-21206:


 Summary: the window slice of Dstream is wrong
 Key: SPARK-21206
 URL: https://issues.apache.org/jira/browse/SPARK-21206
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.1.0
Reporter: Fei Shao


the code is :

val conf = new SparkConf().setAppName("testDstream").setMaster("local[4]")
val ssc = new StreamingContext(conf, Seconds(1))

ssc.checkpoint( "path")
val lines = ssc.socketTextStream("IP", PORT)
lines.countByValueAndWindow( Seconds(2), Seconds(8)).foreachRDD( s => {
  println( "RDD ID IS : " + s.id)
  s.foreach( e => println("data is " + e._1 + " :" + e._2))
  println()
})

The result is wrong. 
I checked the log, it showed:
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Time 1498383086000 ms is valid
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Window time = 2000 ms
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Slide time = 8000 ms
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Zero time = 1498383078000 ms
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = [1498383085000 
ms, 1498383086000 ms]
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
[1498383077000 ms, 1498383078000 ms]
17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms)
17/06/25 17:31:26 INFO ShuffledDStream: Time 1498383078000 ms is invalid as 
zeroTime is 1498383078000 ms , slideDuration is 1000 ms and difference is 0 ms
17/06/25 17:31:26 DEBUG ShuffledDStream: Time 1498383079000 ms is valid
17/06/25 17:31:26 DEBUG MappedDStream: Time 1498383079000 ms is valid

the slice time is wrong.

[BTW]: Team members,
If it was a bug, please don't fix it.I try to fix it myself.Thanks:)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21206) the window slice of Dstream is wrong

2017-06-25 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062276#comment-16062276
 ] 

Sean Owen commented on SPARK-21206:
---

I'm not clear what you are reporting here. What's expected vs actual? you 
always want to be explicit about that if you make a JIRA

> the window slice of Dstream is wrong
> 
>
> Key: SPARK-21206
> URL: https://issues.apache.org/jira/browse/SPARK-21206
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Fei Shao
>
> the code is :
> val conf = new SparkConf().setAppName("testDstream").setMaster("local[4]")
> val ssc = new StreamingContext(conf, Seconds(1))
> ssc.checkpoint( "path")
> val lines = ssc.socketTextStream("IP", PORT)
> lines.countByValueAndWindow( Seconds(2), Seconds(8)).foreachRDD( s => {
>   println( "RDD ID IS : " + s.id)
>   s.foreach( e => println("data is " + e._1 + " :" + e._2))
>   println()
> })
> The result is wrong. 
> I checked the log, it showed:
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Time 1498383086000 ms is valid
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Window time = 2000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Slide time = 8000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Zero time = 1498383078000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = 
> [1498383085000 ms, 1498383086000 ms]
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
> [1498383077000 ms, 1498383078000 ms]
> 17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
> 1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms)
> 17/06/25 17:31:26 INFO ShuffledDStream: Time 1498383078000 ms is invalid as 
> zeroTime is 1498383078000 ms , slideDuration is 1000 ms and difference is 0 ms
> 17/06/25 17:31:26 DEBUG ShuffledDStream: Time 1498383079000 ms is valid
> 17/06/25 17:31:26 DEBUG MappedDStream: Time 1498383079000 ms is valid
> the slice time is wrong.
> [BTW]: Team members,
> If it was a bug, please don't fix it.I try to fix it myself.Thanks:)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21204) RuntimeException with Set and Case Class in Spark 2.1.1

2017-06-25 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062277#comment-16062277
 ] 

Takeshi Yamamuro commented on SPARK-21204:
--

I think the query is invalid:
{code}
scala> Seq((1, 1), (1, 2)).toDF("id", "value")
  .groupBy("id")
  .agg(functions.collect_set("value") as "value")
  .printSchema

root
 |-- id: integer (nullable = false)
 |-- value: array (nullable = true)
 ||-- element: integer (containsNull = true)
{code}
You want to cast array to set?

> RuntimeException with Set and Case Class in Spark 2.1.1
> ---
>
> Key: SPARK-21204
> URL: https://issues.apache.org/jira/browse/SPARK-21204
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.1
>Reporter: Leo Romanovsky
>
> When attempting to produce a Dataset containing a Set, such as with:
> {code:java}
> dbData
>   .groupBy("userId")
>   .agg(functions.collect_set("friendId") as "friendIds")
>   .as[(Int, Set[Int])]
> {code}
> An exception occurs. This can be avoided by casting to a Seq, but sometimes 
> it makes more logical sense have a Set, especially when using the collect_set 
> aggregation operation. Additionally, I am unable to write this Dataset to a 
> Cassandra table containing a Set column without first converting to an RDD.
> {code:java}
> [error] Exception in thread "main" java.lang.UnsupportedOperationException: 
> No Encoder found for Set[Int]
> [error] - field (class: "scala.collection.immutable.Set", name: "_2")
> [error] - root class: "scala.Tuple2"
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:602)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:587)
> [error]   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
> [error]   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
> [error]   at scala.collection.immutable.List.foreach(List.scala:381)
> [error]   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
> [error]   at scala.collection.immutable.List.flatMap(List.scala:344)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:587)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
> [error]   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
> [error]   at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
> [error]   at 
> org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:49)
> {code}
> I think the resolution to this might be similar to adding the Map type - 
> https://github.com/apache/spark/pull/16986



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21207) ML/MLLIB Save Word2Vec Yarn Cluster

2017-06-25 Thread offvolt (JIRA)

offvolt created SPARK-21207:
---

 Summary: ML/MLLIB Save Word2Vec Yarn Cluster 
 Key: SPARK-21207
 URL: https://issues.apache.org/jira/browse/SPARK-21207
 Project: Spark
  Issue Type: Question
  Components: ML, MLlib, PySpark, YARN
Affects Versions: 2.0.1
 Environment: OS : CentOS Linux release 7.3.1611 (Core) 

Clusters :
* vendor_id : GenuineIntel
* cpu family: 6
* model : 79
* model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Reporter: offvolt


Hello everyone, 

I have a question about ML and MLLIB libraries for Word2Vec because I have a 
problem to save a model in Yarn Cluster, 

I already work with word2vec (MLLIB) : 

from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec
from pyspark.mllib.feature import Word2VecModel

sc = SparkContext()
inp = sc.textFile(pathCorpus).map(lambda row: row.split(" "))
word2vec = Word2Vec().setVectorSize(k).setNumIterations(itera)

model = word2vec.fit(inp)
model.save(sc, pathModel)

This code works well in cluster yarn when I use spark-submit like : 
spark-submit --conf spark.driver.maxResultSize=2G --master yarn --deploy-mode 
cluster  --driver-memory 16G --executor-memory 10G --num-executors 10 
--executor-cores 4 MyCode.py

+*But I want to use the new Library ML so I do that : *+

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode, split
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import Word2VecModel
import numpy as np

pathModel = "hdfs:///user/test/w2v.model"

sc = SparkContext(appName = 'Test_App')
sqlContext = SQLContext(sc)

raw_text = sqlContext.read.text(corpusPath).select(split("value", " 
")).toDF("words")

numPart = raw_text.rdd.getNumPartitions() - 1

word2Vec = Word2Vec(vectorSize= k, inputCol="words", outputCol="features", 
minCount = minCount, maxIter= itera).setNumPartitions(numPart)
model = word2Vec.fit(raw_text)

model.findSynonyms("Paris", 20).show()

model.save(pathModel)

This code works in local mode but when I try to deploy in clusters mode (like 
previously) I have a problem because when one cluster writes in hdfs folder the 
other cannot write inside, so at the end I have an empty folder instead of a 
plenty of parquet file like in MLLIB. I don't understand because it works with 
MLLIB but not in ML with the same config when I submitting my code. 

Do you have an idea, how I can solve this problem ? 

I hope I was clear enough. 

Thanks,




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21204) RuntimeException with Set and Case Class in Spark 2.1.1

2017-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21204:


Assignee: (was: Apache Spark)

> RuntimeException with Set and Case Class in Spark 2.1.1
> ---
>
> Key: SPARK-21204
> URL: https://issues.apache.org/jira/browse/SPARK-21204
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.1
>Reporter: Leo Romanovsky
>
> When attempting to produce a Dataset containing a Set, such as with:
> {code:java}
> dbData
>   .groupBy("userId")
>   .agg(functions.collect_set("friendId") as "friendIds")
>   .as[(Int, Set[Int])]
> {code}
> An exception occurs. This can be avoided by casting to a Seq, but sometimes 
> it makes more logical sense have a Set, especially when using the collect_set 
> aggregation operation. Additionally, I am unable to write this Dataset to a 
> Cassandra table containing a Set column without first converting to an RDD.
> {code:java}
> [error] Exception in thread "main" java.lang.UnsupportedOperationException: 
> No Encoder found for Set[Int]
> [error] - field (class: "scala.collection.immutable.Set", name: "_2")
> [error] - root class: "scala.Tuple2"
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:602)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:587)
> [error]   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
> [error]   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
> [error]   at scala.collection.immutable.List.foreach(List.scala:381)
> [error]   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
> [error]   at scala.collection.immutable.List.flatMap(List.scala:344)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:587)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
> [error]   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
> [error]   at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
> [error]   at 
> org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:49)
> {code}
> I think the resolution to this might be similar to adding the Map type - 
> https://github.com/apache/spark/pull/16986



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21204) RuntimeException with Set and Case Class in Spark 2.1.1

2017-06-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062283#comment-16062283
 ] 

Apache Spark commented on SPARK-21204:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/18416

> RuntimeException with Set and Case Class in Spark 2.1.1
> ---
>
> Key: SPARK-21204
> URL: https://issues.apache.org/jira/browse/SPARK-21204
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.1
>Reporter: Leo Romanovsky
>
> When attempting to produce a Dataset containing a Set, such as with:
> {code:java}
> dbData
>   .groupBy("userId")
>   .agg(functions.collect_set("friendId") as "friendIds")
>   .as[(Int, Set[Int])]
> {code}
> An exception occurs. This can be avoided by casting to a Seq, but sometimes 
> it makes more logical sense have a Set, especially when using the collect_set 
> aggregation operation. Additionally, I am unable to write this Dataset to a 
> Cassandra table containing a Set column without first converting to an RDD.
> {code:java}
> [error] Exception in thread "main" java.lang.UnsupportedOperationException: 
> No Encoder found for Set[Int]
> [error] - field (class: "scala.collection.immutable.Set", name: "_2")
> [error] - root class: "scala.Tuple2"
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:602)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:587)
> [error]   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
> [error]   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
> [error]   at scala.collection.immutable.List.foreach(List.scala:381)
> [error]   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
> [error]   at scala.collection.immutable.List.flatMap(List.scala:344)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:587)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
> [error]   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
> [error]   at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
> [error]   at 
> org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:49)
> {code}
> I think the resolution to this might be similar to adding the Map type - 
> https://github.com/apache/spark/pull/16986



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21204) RuntimeException with Set and Case Class in Spark 2.1.1

2017-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21204:


Assignee: Apache Spark

> RuntimeException with Set and Case Class in Spark 2.1.1
> ---
>
> Key: SPARK-21204
> URL: https://issues.apache.org/jira/browse/SPARK-21204
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.1
>Reporter: Leo Romanovsky
>Assignee: Apache Spark
>
> When attempting to produce a Dataset containing a Set, such as with:
> {code:java}
> dbData
>   .groupBy("userId")
>   .agg(functions.collect_set("friendId") as "friendIds")
>   .as[(Int, Set[Int])]
> {code}
> An exception occurs. This can be avoided by casting to a Seq, but sometimes 
> it makes more logical sense have a Set, especially when using the collect_set 
> aggregation operation. Additionally, I am unable to write this Dataset to a 
> Cassandra table containing a Set column without first converting to an RDD.
> {code:java}
> [error] Exception in thread "main" java.lang.UnsupportedOperationException: 
> No Encoder found for Set[Int]
> [error] - field (class: "scala.collection.immutable.Set", name: "_2")
> [error] - root class: "scala.Tuple2"
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:602)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:587)
> [error]   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
> [error]   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
> [error]   at scala.collection.immutable.List.foreach(List.scala:381)
> [error]   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
> [error]   at scala.collection.immutable.List.flatMap(List.scala:344)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:587)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
> [error]   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
> [error]   at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
> [error]   at 
> org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:49)
> {code}
> I think the resolution to this might be similar to adding the Map type - 
> https://github.com/apache/spark/pull/16986



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21206) the window slice of Dstream is wrong

2017-06-25 Thread Fei Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062284#comment-16062284
 ] 

Fei Shao commented on SPARK-21206:
--

Hi  Sean Owen,

I am sorry, I did not give enough message about this issue.

For my test code:

lines.countByValueAndWindow( Seconds(2), Seconds(8)).foreachRDD( s => { 
《=== here the windowDuration is 2 seconds and the slideDuration is 8 seconds. 


===log begin 
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = [1498383085000 
ms, 1498383086000 ms]
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
[1498383077000 ms, 1498383078000 ms]
17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms)  《=== here, 
the old RDD  slices from 1498383077000  to 1498383084000 . It is 8 seconds. 
Actual it should be 2 seconds.
===log end

===code in ReducedWindowedDStream.scala begin
  override def compute(validTime: Time): Option[RDD[(K, V)]] = {
val reduceF = reduceFunc
val invReduceF = invReduceFunc

val currentTime = validTime
val currentWindow = new Interval(currentTime - windowDuration + 
parent.slideDuration,
  currentTime)
val previousWindow = currentWindow - slideDuration

logDebug("Window time = " + windowDuration)
logDebug("Slide time = " + slideDuration)
logDebug("Zero time = " + zeroTime)
logDebug("Current window = " + currentWindow)
logDebug("Previous window = " + previousWindow)

//  _
// |  previous window   _|___
// |___|   current window|  --> Time
// |_|
//
// | _|  | _|
//  | |
//  V V
//   old RDDs new RDDs
//

// Get the RDDs of the reduced values in "old time steps"
val oldRDDs =
  reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime - 
parent.slideDuration) 《== I think this line is 
"reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime  + 
windowDuration - parent.slideDuration)"
logDebug("# old RDDs = " + oldRDDs.size)

// Get the RDDs of the reduced values in "new time steps"
val newRDDs =
  reducedStream.slice(previousWindow.endTime + parent.slideDuration, 
currentWindow.endTime)《== this line is 
reducedStream.slice(previousWindow.endTime + windowDuration - 
parent.slideDuration,
currentWindow.endTime)
logDebug("# new RDDs = " + newRDDs.size)

===code in ReducedWindowedDStream.scala  end



> the window slice of Dstream is wrong
> 
>
> Key: SPARK-21206
> URL: https://issues.apache.org/jira/browse/SPARK-21206
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Fei Shao
>
> the code is :
> val conf = new SparkConf().setAppName("testDstream").setMaster("local[4]")
> val ssc = new StreamingContext(conf, Seconds(1))
> ssc.checkpoint( "path")
> val lines = ssc.socketTextStream("IP", PORT)
> lines.countByValueAndWindow( Seconds(2), Seconds(8)).foreachRDD( s => {
>   println( "RDD ID IS : " + s.id)
>   s.foreach( e => println("data is " + e._1 + " :" + e._2))
>   println()
> })
> The result is wrong. 
> I checked the log, it showed:
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Time 1498383086000 ms is valid
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Window time = 2000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Slide time = 8000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Zero time = 1498383078000 ms
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = 
> [1498383085000 ms, 1498383086000 ms]
> 17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
> [1498383077000 ms, 1498383078000 ms]
> 17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
> 1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms)
> 17/06/25 17:31:26 INFO ShuffledDStream: Time 1498383078000 ms is invalid as 
> zeroTime is 1498383078000 ms , slideDuration is 1000 ms and difference is 0 ms
> 17/06/25 17:31:26 DEBUG ShuffledDStream: Time 1498383079000 ms is valid
> 17/06/25 17:31:26 DEBUG MappedDStream: Time 1498383079000 ms is valid
> the slice time is wrong.
> [BTW]: Team members,
> If it was a bug, please don't fix it.I try to fix it myself.Thanks:)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-ma

[jira] [Comment Edited] (SPARK-21206) the window slice of Dstream is wrong

2017-06-25 Thread Fei Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062284#comment-16062284
 ] 

Fei Shao edited comment on SPARK-21206 at 6/25/17 10:57 AM:


Hi  Sean Owen,

I am sorry, I did not give enough message about this issue.

For my test code:

lines.countByValueAndWindow( Seconds(2), Seconds(8)).foreachRDD( s => { 
《=== here the windowDuration is 2 seconds and the slideDuration is 8 seconds. 


===log begin 
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = [1498383085000 
ms, 1498383086000 ms]
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
[1498383077000 ms, 1498383078000 ms]
17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms)  《=== here, 
the old RDD  slices from 1498383077000  to 1498383084000 . It is 8 seconds. 
Actual it should be 2 seconds.
===log end

===code in ReducedWindowedDStream.scala begin
  override def compute(validTime: Time): Option[RDD[(K, V)]] = {
val reduceF = reduceFunc
val invReduceF = invReduceFunc

val currentTime = validTime
val currentWindow = new Interval(currentTime - windowDuration + 
parent.slideDuration,
  currentTime)
val previousWindow = currentWindow - slideDuration

logDebug("Window time = " + windowDuration)
logDebug("Slide time = " + slideDuration)
logDebug("Zero time = " + zeroTime)
logDebug("Current window = " + currentWindow)
logDebug("Previous window = " + previousWindow)

//  _
// |  previous window   _|___
// |___|   current window|  --> Time
// |_|
//
// | _|  | _|
//  | |
//  V V
//   old RDDs new RDDs
//

// Get the RDDs of the reduced values in "old time steps"
val oldRDDs =
  reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime - 
parent.slideDuration) 《== I think this line is 
"reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime  + 
windowDuration - parent.slideDuration)"
logDebug("# old RDDs = " + oldRDDs.size)

// Get the RDDs of the reduced values in "new time steps"
val newRDDs =
  reducedStream.slice(previousWindow.endTime + parent.slideDuration, 
currentWindow.endTime)《== this line is 
"reducedStream.slice(previousWindow.endTime + windowDuration - 
parent.slideDuration,
currentWindow.endTime)"
logDebug("# new RDDs = " + newRDDs.size)

===code in ReducedWindowedDStream.scala  end




was (Author: robin shao):
Hi  Sean Owen,

I am sorry, I did not give enough message about this issue.

For my test code:

lines.countByValueAndWindow( Seconds(2), Seconds(8)).foreachRDD( s => { 
《=== here the windowDuration is 2 seconds and the slideDuration is 8 seconds. 


===log begin 
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Current window = [1498383085000 
ms, 1498383086000 ms]
17/06/25 17:31:26 DEBUG ReducedWindowedDStream: Previous window = 
[1498383077000 ms, 1498383078000 ms]
17/06/25 17:31:26 INFO ShuffledDStream: Slicing from 1498383077000 ms to 
1498383084000 ms (aligned to 1498383077000 ms and 1498383084000 ms)  《=== here, 
the old RDD  slices from 1498383077000  to 1498383084000 . It is 8 seconds. 
Actual it should be 2 seconds.
===log end

===code in ReducedWindowedDStream.scala begin
  override def compute(validTime: Time): Option[RDD[(K, V)]] = {
val reduceF = reduceFunc
val invReduceF = invReduceFunc

val currentTime = validTime
val currentWindow = new Interval(currentTime - windowDuration + 
parent.slideDuration,
  currentTime)
val previousWindow = currentWindow - slideDuration

logDebug("Window time = " + windowDuration)
logDebug("Slide time = " + slideDuration)
logDebug("Zero time = " + zeroTime)
logDebug("Current window = " + currentWindow)
logDebug("Previous window = " + previousWindow)

//  _
// |  previous window   _|___
// |___|   current window|  --> Time
// |_|
//
// | _|  | _|
//  | |
//  V V
//   old RDDs new RDDs
//

// Get the RDDs of the reduced values in "old time steps"
val oldRDDs =
  reducedStream.slice(previousWindow.begin

[jira] [Resolved] (SPARK-16649) Push partition predicates down into metastore for OptimizeMetadataOnlyQuery

2017-06-25 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-16649.
--
Resolution: Cannot Reproduce

I am resolving this per 
https://github.com/apache/spark/pull/14285#issuecomment-309171604 but I don't 
know which JIRA it is. Please fix my change on {{Resolution}} if anyone knows.

> Push partition predicates down into metastore for OptimizeMetadataOnlyQuery
> ---
>
> Key: SPARK-16649
> URL: https://issues.apache.org/jira/browse/SPARK-16649
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> SPARK-6910 has supported for  pushing partition predicates down into the Hive 
> metastore for table scan. So it also should push partition predicates down 
> into metastore for OptimizeMetadataOnlyQuery.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16946) saveAsTable[append] with different number of columns should throw Exception

2017-06-25 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-16946.
--
Resolution: Cannot Reproduce

I am resolving this per 
https://github.com/apache/spark/pull/14535#issuecomment-309930981 but I don't 
know which JIRA fixes it. Please fix my change on Resolution if anyone know.

> saveAsTable[append] with different number of columns should throw Exception
> ---
>
> Key: SPARK-16946
> URL: https://issues.apache.org/jira/browse/SPARK-16946
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Huaxin Gao
>Priority: Minor
>
> In HiveContext, if saveAsTable[append] has different number of columns, Spark 
> will throw Exception. 
> e.g.
> {code}
> test("saveAsTable[append]: too many columns") {
>   withTable("saveAsTable_too_many_columns") {
> Seq((1, 2)).toDF("i", 
> "j").write.saveAsTable("saveAsTable_too_many_columns")
> val e = intercept[AnalysisException] {
>   Seq((3, 4, 5)).toDF("i", "j", 
> "k").write.mode("append").saveAsTable("saveAsTable_too_many_columns")
> }
> assert(e.getMessage.contains("doesn't match"))
>   }
> }
> {code}
> However, in SparkSession or SQLContext, if use the above code example, the 
> extra column in the append data will be removed silently without any warning 
> or Exception.  The table becomes
> ij
> 3  4
> 1  2
> We may want follow the HiveContext behavior and throw Exception



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21207) ML/MLLIB Save Word2Vec Yarn Cluster

2017-06-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21207.
---
  Resolution: Invalid
Target Version/s:   (was: 2.0.1)

Questions go on the mailing list, like user@
http://spark.apache.org/contributing.html

> ML/MLLIB Save Word2Vec Yarn Cluster 
> 
>
> Key: SPARK-21207
> URL: https://issues.apache.org/jira/browse/SPARK-21207
> Project: Spark
>  Issue Type: Question
>  Components: ML, MLlib, PySpark, YARN
>Affects Versions: 2.0.1
> Environment: OS : CentOS Linux release 7.3.1611 (Core) 
> Clusters :
> * vendor_id   : GenuineIntel
> * cpu family  : 6
> * model   : 79
> * model name  : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
>Reporter: offvolt
>
> Hello everyone, 
> I have a question about ML and MLLIB libraries for Word2Vec because I have a 
> problem to save a model in Yarn Cluster, 
> I already work with word2vec (MLLIB) : 
> from pyspark import SparkContext
> from pyspark.mllib.feature import Word2Vec
> from pyspark.mllib.feature import Word2VecModel
> sc = SparkContext()
> inp = sc.textFile(pathCorpus).map(lambda row: row.split(" "))
> word2vec = Word2Vec().setVectorSize(k).setNumIterations(itera)
> model = word2vec.fit(inp)
> model.save(sc, pathModel)
> This code works well in cluster yarn when I use spark-submit like : 
> spark-submit --conf spark.driver.maxResultSize=2G --master yarn --deploy-mode 
> cluster  --driver-memory 16G --executor-memory 10G --num-executors 10 
> --executor-cores 4 MyCode.py
> +*But I want to use the new Library ML so I do that : *+
> from pyspark import SparkContext
> from pyspark.sql import SQLContext
> from pyspark.sql.functions import explode, split
> from pyspark.ml.feature import Word2Vec
> from pyspark.ml.feature import Word2VecModel
> import numpy as np
> pathModel = "hdfs:///user/test/w2v.model"
> sc = SparkContext(appName = 'Test_App')
> sqlContext = SQLContext(sc)
> raw_text = sqlContext.read.text(corpusPath).select(split("value", " 
> ")).toDF("words")
> numPart = raw_text.rdd.getNumPartitions() - 1
> word2Vec = Word2Vec(vectorSize= k, inputCol="words", outputCol="features", 
> minCount = minCount, maxIter= itera).setNumPartitions(numPart)
> model = word2Vec.fit(raw_text)
> model.findSynonyms("Paris", 20).show()
> model.save(pathModel)
> This code works in local mode but when I try to deploy in clusters mode (like 
> previously) I have a problem because when one cluster writes in hdfs folder 
> the other cannot write inside, so at the end I have an empty folder instead 
> of a plenty of parquet file like in MLLIB. I don't understand because it 
> works with MLLIB but not in ML with the same config when I submitting my 
> code. 
> Do you have an idea, how I can solve this problem ? 
> I hope I was clear enough. 
> Thanks,



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11057) SQL: corr and cov for many columns

2017-06-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11057.
---
Resolution: Won't Fix

> SQL: corr and cov for many columns
> --
>
> Key: SPARK-11057
> URL: https://issues.apache.org/jira/browse/SPARK-11057
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Narine Kokhlikyan
>
> Hi there,
> As we know R has the option to calculate the correlation and covariance for 
> all columns of a dataframe or between columns of two dataframes.
> If we look at apache math package we can see that, they have that too. 
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> In case we have as input only one DataFrame:
> --
> for correlation:
> cor[i,j] = cor[j,i]
> and for the main diagonal we can have 1s.
> -
> for covariance: 
> cov[i,j] = cov[j,i]
> and for main diagonal: we can compute the variance for that specific column:
> See:
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> Let me know what do you think.
> I'm working on this and will make a pull request soon.
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10796) The Stage taskSets may are all removed while stage still have pending partitions after having lost some executors

2017-06-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10796.
---
Resolution: Won't Fix

> The Stage taskSets may are all removed while stage still have pending 
> partitions after having lost some executors
> -
>
> Key: SPARK-10796
> URL: https://issues.apache.org/jira/browse/SPARK-10796
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: SuYan
>Priority: Minor
>
> desc:
> 1. We know a running ShuffleMapStage will have multiple TaskSet: one Active 
> TaskSet, multiple Zombie TaskSet, and mutiple removedTaskSet
> We think a running ShuffleMapStage is success only if its partition are all 
> process success, namely each task‘s MapStatus are all add into outputLocs
> MapStatus of running ShuffleMapStage may succeed by RemovedTaskSet1../Zombie 
> TaskSet1 / Zombie TaskSet2 // Active TaskSetN. So it had a chance that 
> some output only hold by some RemovedTaskset or ZombieTaskSet.
> If lost a executor, it chanced that some lost-executor related MapStatus are 
> succeed by some Zombie TaskSet.
> In current logical, The solution to resolved that lost MapStatus problem is,
> each TaskSet re-running that those tasks which succeed in lost-executor: 
> re-add into TaskSet's pendingTasks, 
> and re-add it paritions into Stage‘s pendingPartitions .
> but it is useless if that lost MapStatus only belong to Zombie/Removed 
> TaskSet, it is Zombie, so will never be scheduled his pendingTasks
> The condition for resubmit stage is only if some task throws 
> FetchFailedException, but may the lost-executor just not empty any MapStatus 
> of parent Stage for one of running Stages, 
> and it‘s happen to that running Stage was lost a MapStatus only belong to a 
> ZombieTask or removedTaskset.
> So if all Zombie TaskSets are all processed his runningTasks and Active 
> TaskSet are all processed his pendingTask, then will removed by 
> TaskSchedulerImp, then that running Stage's pending partitions is still 
> nonEmpty. it will hangs..
> {code}
>  test("Resubmit stage while lost partition in ZombieTasksets or 
> RemovedTaskSets") {
> val firstRDD = new MyRDD(sc, 3, Nil)
> val firstShuffleDep = new ShuffleDependency(firstRDD, new 
> HashPartitioner(3))
> val firstShuffleId = firstShuffleDep.shuffleId
> val shuffleMapRdd = new MyRDD(sc, 3, List(firstShuffleDep))
> val shuffleDep = new ShuffleDependency(shuffleMapRdd, new 
> HashPartitioner(3))
> val reduceRdd = new MyRDD(sc, 1, List(shuffleDep))
> submit(reduceRdd, Array(0))
> // things start out smoothly, stage 0 completes with no issues
> complete(taskSets(0), Seq(
>   (Success, makeMapStatus("hostB", shuffleMapRdd.partitions.length)),
>   (Success, makeMapStatus("hostB", shuffleMapRdd.partitions.length)),
>   (Success, makeMapStatus("hostA", shuffleMapRdd.partitions.length))
> ))
> // then start running stage 1
> runEvent(makeCompletionEvent(
>   taskSets(1).tasks(0),
>   Success,
>   makeMapStatus("hostD", shuffleMapRdd.partitions.length)))
> // simulate make stage 1 resubmit, notice for stage1.0
> // partitionId=1 already finished in hostD, so if we resubmit stage1,
> // stage 1.1 only resubmit tasks for partitionId = 0,2
> runEvent(makeCompletionEvent(
>   taskSets(1).tasks(1),
>   FetchFailed(null, firstShuffleId, 2, 1, "Fetch failed"), null))
> scheduler.resubmitFailedStages()
> val stage1Resubmit1 = taskSets(2)
> assert(stage1Resubmit1.stageId == 1)
> assert(stage1Resubmit1.tasks.size == 2)
> // now exec-hostD lost, so the output loc of stage1 partitionId=1 will 
> lost.
> runEvent(ExecutorLost("exec-hostD"))
> runEvent(makeCompletionEvent(taskSets(1).tasks(0), Resubmitted, null))
> // let stage1Resubmit1 complete
> complete(taskSets(2), Seq(
>   (Success, makeMapStatus("hostB", shuffleMapRdd.partitions.length)),
>   (Success, makeMapStatus("hostB", shuffleMapRdd.partitions.length))
> ))
> // and let we complete tasksets1.0's active running Tasks
> runEvent(makeCompletionEvent(
>   taskSets(1).tasks(1),
>   Success,
>   makeMapStatus("hostD", shuffleMapRdd.partitions.length)))
> runEvent(makeCompletionEvent(
>   taskSets(1).tasks(2),
>   Success,
>   makeMapStatus("hostD", shuffleMapRdd.partitions.length)))
> // Now all runningTasksets for stage1 was all completed. 
> assert(scheduler.runningStages.head.pendingPartitions.head == 0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: is

[jira] [Resolved] (SPARK-12948) Consider reducing size of broadcasts in OrcRelation

2017-06-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12948.
---
Resolution: Won't Fix

> Consider reducing size of broadcasts in OrcRelation
> ---
>
> Key: SPARK-12948
> URL: https://issues.apache.org/jira/browse/SPARK-12948
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
> Attachments: SPARK-12948_cpuProf.png, 
> SPARK-12948.mem.prof.snapshot.png
>
>
> Size of broadcasted data in OrcRelation was significantly higher when running 
> query with large number of partitions (e.g TPC-DS). Consider reducing the 
> size of the broadcasted data in OrcRelation, as it has an impact on the job 
> runtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15763) Add DELETE FILE command support in spark

2017-06-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15763.
---
Resolution: Won't Fix

> Add DELETE FILE command support in spark
> 
>
> Key: SPARK-15763
> URL: https://issues.apache.org/jira/browse/SPARK-15763
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: kevin yu
>
> Currently Spark support "Add File/Jar  " in SPARK SQL, but 
> not "Delete File/Jar ", I am adding support for the "Delete 
> File" from the Spark context. Hive support "ADD/DELETE/LIST FILE/Jar" 
> commands. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli
> I will submit the DELETE Jar in another jira.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19869) move table related ddl from ddl.scala to tables.scala

2017-06-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19869.
---
Resolution: Won't Fix

> move table related ddl from ddl.scala to tables.scala
> -
>
> Key: SPARK-19869
> URL: https://issues.apache.org/jira/browse/SPARK-19869
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Priority: Minor
>
> move table related ddl from ddl.scala to tables.scala



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20413) New Optimizer Hint to prevent collapsing of adjacent projections

2017-06-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20413.
---
Resolution: Won't Fix

> New Optimizer Hint to prevent collapsing of adjacent projections
> 
>
> Key: SPARK-20413
> URL: https://issues.apache.org/jira/browse/SPARK-20413
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>
> I am proposing that a new optimizer hint called NO_COLLAPSE be introduced. 
> This hint is essentially identical to Oracle's NO_MERGE hint. 
> Let me first give an example of why I am proposing this. 
> {noformat}
> df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) 
> df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) 
> df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), 
> df2["ua"].browser_version.alias("c2")) 
> df3.explain(True) 
> == Parsed Logical Plan == 
> 'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS 
> c2#91] 
> +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] 
>+- LogicalRDD [id#80L, user_agent#81] 
> == Analyzed Logical Plan == 
> c1: string, c2: string 
> Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS c2#91] 
> +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] 
>+- LogicalRDD [id#80L, user_agent#81] 
> == Optimized Logical Plan == 
> Project [UDF(user_agent#81).device_form_factor AS c1#90, 
> UDF(user_agent#81).browser_version AS c2#91] 
> +- LogicalRDD [id#80L, user_agent#81] 
> == Physical Plan == 
> *Project [UDF(user_agent#81).device_form_factor AS c1#90, 
> UDF(user_agent#81).browser_version AS c2#91] 
> +- Scan ExistingRDD[id#80L,user_agent#81] 
> {noformat}
> user_agent_details is a user-defined function that returns a struct. As can 
> be seen from the generated query plan, the function is being executed 
> multiple times which could lead to performance issues. This is due to the 
> CollapseProject optimizer rule that collapses adjacent projections. 
> I'm proposing a hint that will prevent the optimizer from collapsing adjacent 
> projections. A new function called 'no_collapse' would be introduced for this 
> purpose. Consider the following example and generated query plan. 
> {noformat}
> df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) 
> df2 = F.no_collapse(df1.withColumn("ua", 
> user_agent_details(df1["user_agent"]))) 
> df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), 
> df2["ua"].browser_version.alias("c2")) 
> df3.explain(True) 
> == Parsed Logical Plan == 
> 'Project [ua#69[device_form_factor] AS c1#75, ua#69[browser_version] AS 
> c2#76] 
> +- NoCollapseHint 
>+- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] 
>   +- LogicalRDD [id#64L, user_agent#65] 
> == Analyzed Logical Plan == 
> c1: string, c2: string 
> Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] 
> +- NoCollapseHint 
>+- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] 
>   +- LogicalRDD [id#64L, user_agent#65] 
> == Optimized Logical Plan == 
> Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] 
> +- NoCollapseHint 
>+- Project [UDF(user_agent#65) AS ua#69] 
>   +- LogicalRDD [id#64L, user_agent#65] 
> == Physical Plan == 
> *Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] 
> +- *Project [UDF(user_agent#65) AS ua#69] 
>+- Scan ExistingRDD[id#64L,user_agent#65] 
> {noformat}
> As can be seen from the query plan, the user-defined function is now 
> evaluated once per row.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20115) Fix DAGScheduler to recompute all the lost shuffle blocks when external shuffle service is unavailable

2017-06-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20115.
---
Resolution: Won't Fix

> Fix DAGScheduler to recompute all the lost shuffle blocks when external 
> shuffle service is unavailable
> --
>
> Key: SPARK-20115
> URL: https://issues.apache.org/jira/browse/SPARK-20115
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core, YARN
>Affects Versions: 2.0.2, 2.1.0
> Environment: Spark on Yarn with external shuffle service enabled, 
> running on AWS EMR cluster.
>Reporter: Udit Mehrotra
>
> The Spark’s DAGScheduler currently does not recompute all the lost shuffle 
> blocks on a host when a FetchFailed exception occurs, while fetching shuffle 
> blocks from another executor with external shuffle service enabled. Instead 
> it only recomputes the lost shuffle blocks computed by the executor for which 
> the FetchFailed exception occurred. This works fine for Internal shuffle 
> scenario, where the executors serve their own shuffle blocks and hence only 
> the shuffle blocks for that executor should be considered lost. However, when 
> External Shuffle Service is being used, a FetchFailed exception would mean 
> that the external shuffle service running on that host has become 
> unavailable. This in turn is sufficient to assume that all the shuffle blocks 
> which were managed by the Shuffle service on that host are lost. Therefore, 
> just recomputing the shuffle blocks associated with the particular Executor 
> for which FetchFailed exception occurred is not sufficient. We need to 
> recompute all the shuffle blocks, managed by that service because there could 
> be multiple executors running on that host.
>  
> Since not all the shuffle blocks (for all the executors on the host) are 
> recomputed, this causes future attempts of the reduce stage to fail as well 
> because the new tasks scheduled still keep trying to reach the old location 
> of the shuffle blocks (which were not recomputed) and keep throwing further 
> FetchFailed exceptions. This ultimately causes the job to fail, after the 
> reduce stage has been retried 4 times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21208) Ability to "setLocalProperty" from sc, in sparkR

2017-06-25 Thread Karuppayya (JIRA)

Karuppayya created SPARK-21208:
--

 Summary: Ability to "setLocalProperty" from sc, in sparkR
 Key: SPARK-21208
 URL: https://issues.apache.org/jira/browse/SPARK-21208
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.1
Reporter: Karuppayya


Checked the API 
[documentation|https://spark.apache.org/docs/latest/api/R/index.html] for 
sparkR.
Was not able to find a way to *setLocalProperty* on sc.
Need ability to *setLocalProperty* on sparkContext(similar to available for 
pyspark, scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-06-25 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062368#comment-16062368
 ] 

Yuming Wang commented on SPARK-21067:
-

Are you enable [HDFS 
Federation|http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/Federation.html]?
 Try to add this configure to {{hive-site.xml}}:
{code:xml}

hive.exec.stagingdir
  /user/hive/warehouse/staging/.hive-staging

{code}


> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
> at 
> org.apache.spark.sql.execution.QueryExecution

[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster

2017-06-25 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062370#comment-16062370
 ] 

Yuming Wang commented on SPARK-21063:
-

May be you should do this before submit your Spark application:
{code}
export HADOOP_CONF_DIR=/path/to/remote/hadoop/conf
export HIVE_CONF_DIR=/path/to/remote/hive/conf
{code}


> Spark return an empty result from remote hadoop cluster
> ---
>
> Key: SPARK-21063
> URL: https://issues.apache.org/jira/browse/SPARK-21063
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Peter Bykov
>
> Spark returning empty result from when querying remote hadoop cluster.
> All firewall settings removed.
> Querying using JDBC working properly using hive-jdbc driver from version 1.1.1
> Code snippet is:
> {code:java}
> val spark = SparkSession.builder
> .appName("RemoteSparkTest")
> .master("local")
> .getOrCreate()
> val df = spark.read
>   .option("url", "jdbc:hive2://remote.hive.local:1/default")
>   .option("user", "user")
>   .option("password", "pass")
>   .option("dbtable", "test_table")
>   .option("driver", "org.apache.hive.jdbc.HiveDriver")
>   .format("jdbc")
>   .load()
>  
> df.show()
> {code}
> Result:
> {noformat}
> +---+
> |test_table.test_col|
> +---+
> +---+
> {noformat}
> All manipulations like: 
> {code:java}
> df.select(*).show()
> {code}
> returns empty result too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21208) Ability to "setLocalProperty" from sc, in sparkR

2017-06-25 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062396#comment-16062396
 ] 

Felix Cheung commented on SPARK-21208:
--

+1

> Ability to "setLocalProperty" from sc, in sparkR
> 
>
> Key: SPARK-21208
> URL: https://issues.apache.org/jira/browse/SPARK-21208
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Karuppayya
>
> Checked the API 
> [documentation|https://spark.apache.org/docs/latest/api/R/index.html] for 
> sparkR.
> Was not able to find a way to *setLocalProperty* on sc.
> Need ability to *setLocalProperty* on sparkContext(similar to available for 
> pyspark, scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-25 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-21093.
--
  Resolution: Fixed
Assignee: Hyukjin Kwon
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 2.3.0
>
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> Error in handleErrors(returnStatus, conn) :
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 
> in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage 
> 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: 
> R computation failed with
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.a
> ...
> *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated
> === Backtrace: =
> /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597]
> /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750]
> /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507]
> /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015]
> /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e]
> /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4]
> /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529]
> /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7]
> /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1]
> /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9]
> /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7f

[jira] [Commented] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-25 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062399#comment-16062399
 ] 

Felix Cheung commented on SPARK-21093:
--

since this is a very core change to SparkR, we agree it might not be best to 
push to 2.2 branch right now as we are so close to releasing 2.2.0, and let it 
cooks for a bit first
(though RC sign off might still run into this issue then)

> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 2.3.0
>
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> Error in handleErrors(returnStatus, conn) :
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 
> in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage 
> 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: 
> R computation failed with
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.a
> ...
> *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated
> === Backtrace: =
> /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597]
> /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750]
> /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507]
> /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015]
> /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e]
> /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4]
> /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529]
> /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7]
> /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1]
> /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9]
> /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x

[jira] [Comment Edited] (SPARK-12806) Support SQL expressions extracting values from VectorUDT

2017-06-25 Thread Franklyn Dsouza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061718#comment-16061718
 ] 

Franklyn Dsouza edited comment on SPARK-12806 at 6/25/17 11:32 PM:
---

[~mengxr] could you please take a look ? we have had to jump through a lot of 
hoops to work around this.


was (Author: franklyndsouza):
[~mengxr] could you please take a look ? we have had to jumpy through a lot of 
hoops to work around this.

> Support SQL expressions extracting values from VectorUDT
> 
>
> Key: SPARK-12806
> URL: https://issues.apache.org/jira/browse/SPARK-12806
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, SQL
>Affects Versions: 1.6.0
>Reporter: Feynman Liang
>
> Use cases exist where a specific index within a {{VectorUDT}} column of a 
> {{DataFrame}} is required. For example, we may be interested in extracting a 
> specific class probability from the {{probabilityCol}} of a 
> {{LogisticRegression}} to compute losses. However, if {{probability}} is a 
> column of {{df}} with type {{VectorUDT}}, the following code fails:
> {code}
> df.select("probability.0")
> AnalysisException: u"Can't extract value from probability"
> {code}
> thrown from 
> {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala}}.
> {{VectorUDT}} essentially wraps a {{StructType}}, hence one would expect it 
> to support value extraction Expressions in an analogous way.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21208) Ability to "setLocalProperty" from sc, in sparkR

2017-06-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062460#comment-16062460
 ] 

Hyukjin Kwon commented on SPARK-21208:
--

[~karup1990], are you working on this? I sent a similar PR with this before and 
I think I could work on this if you are work.

> Ability to "setLocalProperty" from sc, in sparkR
> 
>
> Key: SPARK-21208
> URL: https://issues.apache.org/jira/browse/SPARK-21208
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Karuppayya
>
> Checked the API 
> [documentation|https://spark.apache.org/docs/latest/api/R/index.html] for 
> sparkR.
> Was not able to find a way to *setLocalProperty* on sc.
> Need ability to *setLocalProperty* on sparkContext(similar to available for 
> pyspark, scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21208) Ability to "setLocalProperty" from sc, in sparkR

2017-06-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062460#comment-16062460
 ] 

Hyukjin Kwon edited comment on SPARK-21208 at 6/26/17 12:24 AM:


[~karup1990], are you working on this? I sent a similar PR with this before and 
I think I could work on this if you are not.


was (Author: hyukjin.kwon):
[~karup1990], are you working on this? I sent a similar PR with this before and 
I think I could work on this if you are work.

> Ability to "setLocalProperty" from sc, in sparkR
> 
>
> Key: SPARK-21208
> URL: https://issues.apache.org/jira/browse/SPARK-21208
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Karuppayya
>
> Checked the API 
> [documentation|https://spark.apache.org/docs/latest/api/R/index.html] for 
> sparkR.
> Was not able to find a way to *setLocalProperty* on sc.
> Need ability to *setLocalProperty* on sparkContext(similar to available for 
> pyspark, scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21204) RuntimeException with Set and Case Class in Spark 2.1.1

2017-06-25 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062488#comment-16062488
 ] 

Liang-Chi Hsieh commented on SPARK-21204:
-

[~maropu] I think It's more likely to turn to use set as the domain object of 
the Dataset. Currently SparkSQL encoders don't support Set.

> RuntimeException with Set and Case Class in Spark 2.1.1
> ---
>
> Key: SPARK-21204
> URL: https://issues.apache.org/jira/browse/SPARK-21204
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.1
>Reporter: Leo Romanovsky
>
> When attempting to produce a Dataset containing a Set, such as with:
> {code:java}
> dbData
>   .groupBy("userId")
>   .agg(functions.collect_set("friendId") as "friendIds")
>   .as[(Int, Set[Int])]
> {code}
> An exception occurs. This can be avoided by casting to a Seq, but sometimes 
> it makes more logical sense have a Set, especially when using the collect_set 
> aggregation operation. Additionally, I am unable to write this Dataset to a 
> Cassandra table containing a Set column without first converting to an RDD.
> {code:java}
> [error] Exception in thread "main" java.lang.UnsupportedOperationException: 
> No Encoder found for Set[Int]
> [error] - field (class: "scala.collection.immutable.Set", name: "_2")
> [error] - root class: "scala.Tuple2"
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:602)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:587)
> [error]   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
> [error]   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
> [error]   at scala.collection.immutable.List.foreach(List.scala:381)
> [error]   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
> [error]   at scala.collection.immutable.List.flatMap(List.scala:344)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:587)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
> [error]   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
> [error]   at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
> [error]   at 
> org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:49)
> {code}
> I think the resolution to this might be similar to adding the Map type - 
> https://github.com/apache/spark/pull/16986



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21198) SparkSession catalog is terribly slow

2017-06-25 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062489#comment-16062489
 ] 

Liang-Chi Hsieh commented on SPARK-21198:
-

[~revolucion09] Any update?

Because it involves public API, I prefer to be more careful. So before we make 
sure this issue is caused by {{CatalogImpl.listTables}}, I will hold the PR.

> SparkSession catalog is terribly slow
> -
>
> Key: SPARK-21198
> URL: https://issues.apache.org/jira/browse/SPARK-21198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Saif Addin
>
> We have a considerably large Hive metastore and a Spark program that goes 
> through Hive data availability.
> In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and 
> sqlContext.isCached() to go throgh Hive metastore information.
> Once migrated to spark 2.x we switched over SparkSession.catalog instead, but 
> it turns out that both listDatabases() and listTables() take between 5 to 20 
> minutes depending on the database to return results, using operations such as 
> the following one:
> spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect
> and made the program unbearably slow to return a list of tables.
> I know we still have spark.sqlContext.tableNames as workaround but I am 
> assuming this is going to be deprecated anytime soon?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21204) RuntimeException with Set and Case Class in Spark 2.1.1

2017-06-25 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062496#comment-16062496
 ] 

Takeshi Yamamuro commented on SPARK-21204:
--

ok, I'll check your pr later. Thanks!

> RuntimeException with Set and Case Class in Spark 2.1.1
> ---
>
> Key: SPARK-21204
> URL: https://issues.apache.org/jira/browse/SPARK-21204
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.1
>Reporter: Leo Romanovsky
>
> When attempting to produce a Dataset containing a Set, such as with:
> {code:java}
> dbData
>   .groupBy("userId")
>   .agg(functions.collect_set("friendId") as "friendIds")
>   .as[(Int, Set[Int])]
> {code}
> An exception occurs. This can be avoided by casting to a Seq, but sometimes 
> it makes more logical sense have a Set, especially when using the collect_set 
> aggregation operation. Additionally, I am unable to write this Dataset to a 
> Cassandra table containing a Set column without first converting to an RDD.
> {code:java}
> [error] Exception in thread "main" java.lang.UnsupportedOperationException: 
> No Encoder found for Set[Int]
> [error] - field (class: "scala.collection.immutable.Set", name: "_2")
> [error] - root class: "scala.Tuple2"
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:602)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:587)
> [error]   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
> [error]   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
> [error]   at scala.collection.immutable.List.foreach(List.scala:381)
> [error]   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
> [error]   at scala.collection.immutable.List.flatMap(List.scala:344)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:587)
> [error]   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
> [error]   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
> [error]   at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
> [error]   at 
> org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:49)
> {code}
> I think the resolution to this might be similar to adding the Map type - 
> https://github.com/apache/spark/pull/16986



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21198) SparkSession catalog is terribly slow

2017-06-25 Thread Saif Addin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062498#comment-16062498
 ] 

Saif Addin commented on SPARK-21198:


[~viirya]
I'll take a look again on Monday, but I am positive that the catalog retrieve 
is causing the delay. I'll also share some piece of code of how I do it just in 
case. For now I am using spark.sqlContext.tableNames again.

> SparkSession catalog is terribly slow
> -
>
> Key: SPARK-21198
> URL: https://issues.apache.org/jira/browse/SPARK-21198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Saif Addin
>
> We have a considerably large Hive metastore and a Spark program that goes 
> through Hive data availability.
> In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and 
> sqlContext.isCached() to go throgh Hive metastore information.
> Once migrated to spark 2.x we switched over SparkSession.catalog instead, but 
> it turns out that both listDatabases() and listTables() take between 5 to 20 
> minutes depending on the database to return results, using operations such as 
> the following one:
> spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect
> and made the program unbearably slow to return a list of tables.
> I know we still have spark.sqlContext.tableNames as workaround but I am 
> assuming this is going to be deprecated anytime soon?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21198) SparkSession catalog is terribly slow

2017-06-25 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062499#comment-16062499
 ] 

Liang-Chi Hsieh commented on SPARK-21198:
-

Thanks, [~revolucion09]. Can you measure how much time 
{{CatalogImpl.listTables}} needs to finish on your database and the number of 
tables in it? So it can be more convincing.

> SparkSession catalog is terribly slow
> -
>
> Key: SPARK-21198
> URL: https://issues.apache.org/jira/browse/SPARK-21198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Saif Addin
>
> We have a considerably large Hive metastore and a Spark program that goes 
> through Hive data availability.
> In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and 
> sqlContext.isCached() to go throgh Hive metastore information.
> Once migrated to spark 2.x we switched over SparkSession.catalog instead, but 
> it turns out that both listDatabases() and listTables() take between 5 to 20 
> minutes depending on the database to return results, using operations such as 
> the following one:
> spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect
> and made the program unbearably slow to return a list of tables.
> I know we still have spark.sqlContext.tableNames as workaround but I am 
> assuming this is going to be deprecated anytime soon?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21208) Ability to "setLocalProperty" from sc, in sparkR

2017-06-25 Thread Karuppayya (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062523#comment-16062523
 ] 

Karuppayya commented on SPARK-21208:


[~hyukjin.kwon] I am currentlty not working on this.
Thanks for letting me know.
Can you please point me to  jira/PR .

> Ability to "setLocalProperty" from sc, in sparkR
> 
>
> Key: SPARK-21208
> URL: https://issues.apache.org/jira/browse/SPARK-21208
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Karuppayya
>
> Checked the API 
> [documentation|https://spark.apache.org/docs/latest/api/R/index.html] for 
> sparkR.
> Was not able to find a way to *setLocalProperty* on sc.
> Need ability to *setLocalProperty* on sparkContext(similar to available for 
> pyspark, scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21208) Ability to "setLocalProperty" from sc, in sparkR

2017-06-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062525#comment-16062525
 ] 

Hyukjin Kwon commented on SPARK-21208:
--

Here - SPARK-21149.

> Ability to "setLocalProperty" from sc, in sparkR
> 
>
> Key: SPARK-21208
> URL: https://issues.apache.org/jira/browse/SPARK-21208
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Karuppayya
>
> Checked the API 
> [documentation|https://spark.apache.org/docs/latest/api/R/index.html] for 
> sparkR.
> Was not able to find a way to *setLocalProperty* on sc.
> Need ability to *setLocalProperty* on sparkContext(similar to available for 
> pyspark, scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19866) Add local version of Word2Vec findSynonyms for spark.ml: Python API

2017-06-25 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-19866:
---

Assignee: Xin Ren

> Add local version of Word2Vec findSynonyms for spark.ml: Python API
> ---
>
> Key: SPARK-19866
> URL: https://issues.apache.org/jira/browse/SPARK-19866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Xin Ren
>Priority: Minor
>
> Add Python API for findSynonymsArray matching Scala API in linked JIRA.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19104:


Assignee: (was: Apache Spark)

>  CompileException with Map and Case Class in Spark 2.1.0
> 
>
> Key: SPARK-19104
> URL: https://issues.apache.org/jira/browse/SPARK-19104
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Nils Grabbert
>
> The following code will run with Spark 2.0.2 but not with Spark 2.1.0:
> {code}
> case class InnerData(name: String, value: Int)
> case class Data(id: Int, param: Map[String, InnerData])
> val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
> 100
> val ds   = spark.createDataset(data)
> {code}
> Exception:
> {code}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 63, Column 46: Expression 
> "ExternalMapToCatalyst_value_isNull1" is not an rvalue 
>   at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
>   at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
>  
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
>   at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984)
>  
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
>   at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
>   at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>  
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>  
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>  
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>  
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
>  
>   at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) 
>   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:196) 
>   at org.codehaus.common

[jira] [Commented] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-06-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062528#comment-16062528
 ] 

Apache Spark commented on SPARK-19104:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/18418

>  CompileException with Map and Case Class in Spark 2.1.0
> 
>
> Key: SPARK-19104
> URL: https://issues.apache.org/jira/browse/SPARK-19104
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Nils Grabbert
>
> The following code will run with Spark 2.0.2 but not with Spark 2.1.0:
> {code}
> case class InnerData(name: String, value: Int)
> case class Data(id: Int, param: Map[String, InnerData])
> val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
> 100
> val ds   = spark.createDataset(data)
> {code}
> Exception:
> {code}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 63, Column 46: Expression 
> "ExternalMapToCatalyst_value_isNull1" is not an rvalue 
>   at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
>   at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
>  
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
>   at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984)
>  
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
>   at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
>   at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>  
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>  
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>  
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>  
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
>  
>   at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluat

[jira] [Assigned] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19104:


Assignee: Apache Spark

>  CompileException with Map and Case Class in Spark 2.1.0
> 
>
> Key: SPARK-19104
> URL: https://issues.apache.org/jira/browse/SPARK-19104
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Nils Grabbert
>Assignee: Apache Spark
>
> The following code will run with Spark 2.0.2 but not with Spark 2.1.0:
> {code}
> case class InnerData(name: String, value: Int)
> case class Data(id: Int, param: Map[String, InnerData])
> val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
> 100
> val ds   = spark.createDataset(data)
> {code}
> Exception:
> {code}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 63, Column 46: Expression 
> "ExternalMapToCatalyst_value_isNull1" is not an rvalue 
>   at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
>   at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
>  
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
>   at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984)
>  
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
>   at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
>   at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>  
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>  
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>  
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>  
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
>  
>   at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) 
>   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:196) 
>

[jira] [Commented] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-06-25 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062529#comment-16062529
 ] 

Liang-Chi Hsieh commented on SPARK-19104:
-

Just found this issue. I proposed a PR to fix it. There's RC5, so I'm not sure 
if this can be in 2.2.0.

>  CompileException with Map and Case Class in Spark 2.1.0
> 
>
> Key: SPARK-19104
> URL: https://issues.apache.org/jira/browse/SPARK-19104
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Nils Grabbert
>
> The following code will run with Spark 2.0.2 but not with Spark 2.1.0:
> {code}
> case class InnerData(name: String, value: Int)
> case class Data(id: Int, param: Map[String, InnerData])
> val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
> 100
> val ds   = spark.createDataset(data)
> {code}
> Exception:
> {code}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 63, Column 46: Expression 
> "ExternalMapToCatalyst_value_isNull1" is not an rvalue 
>   at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
>   at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
>  
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
>   at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984)
>  
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
>   at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
>   at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>  
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>  
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>  
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>  
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
>  
>   at org.codehaus.janino.ClassBodyEvaluator.cook(Class

[jira] [Comment Edited] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-06-25 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062529#comment-16062529
 ] 

Liang-Chi Hsieh edited comment on SPARK-19104 at 6/26/17 4:37 AM:
--

Just found this issue. I proposed a PR to fix it. There's RC5 of 2.2, so I'm 
not sure if this can be in 2.2.0.


was (Author: viirya):
Just found this issue. I proposed a PR to fix it. There's RC5, so I'm not sure 
if this can be in 2.2.0.

>  CompileException with Map and Case Class in Spark 2.1.0
> 
>
> Key: SPARK-19104
> URL: https://issues.apache.org/jira/browse/SPARK-19104
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Nils Grabbert
>
> The following code will run with Spark 2.0.2 but not with Spark 2.1.0:
> {code}
> case class InnerData(name: String, value: Int)
> case class Data(id: Int, param: Map[String, InnerData])
> val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
> 100
> val ds   = spark.createDataset(data)
> {code}
> Exception:
> {code}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 63, Column 46: Expression 
> "ExternalMapToCatalyst_value_isNull1" is not an rvalue 
>   at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
>   at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
>  
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
>   at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984)
>  
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
>   at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
>   at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>  
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>  
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>  
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoad

[jira] [Commented] (SPARK-20213) DataFrameWriter operations do not show up in SQL tab

2017-06-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062534#comment-16062534
 ] 

Apache Spark commented on SPARK-20213:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/18419

> DataFrameWriter operations do not show up in SQL tab
> 
>
> Key: SPARK-20213
> URL: https://issues.apache.org/jira/browse/SPARK-20213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Ryan Blue
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
> Attachments: Screen Shot 2017-05-03 at 5.00.19 PM.png
>
>
> In 1.6.1, {{DataFrame}} writes started using {{DataFrameWriter}} actions like 
> {{insertInto}} would show up in the SQL tab. In 2.0.0 and later, they no 
> longer do. The problem is that 2.0.0 and later no longer wrap execution with 
> {{SQLExecution.withNewExecutionId}}, which emits 
> {{SparkListenerSQLExecutionStart}}.
> Here are the relevant parts of the stack traces:
> {code:title=Spark 1.6.1}
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56)
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56)
>  => holding 
> Monitor(org.apache.spark.sql.hive.HiveContext$QueryExecution@424773807})
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:196)
> {code}
> {code:title=Spark 2.0.0}
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>  => holding Monitor(org.apache.spark.sql.execution.QueryExecution@490977924})
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:301)
> {code}
> I think this was introduced by 
> [54d23599|https://github.com/apache/spark/commit/54d23599]. The fix should be 
> to add withNewExecutionId to 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L610



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

44 matches

Mail list logo