date:20170508

[jira] [Assigned] (SPARK-20670) Simplify FPGrowth transform

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20670:


Assignee: (was: Apache Spark)

> Simplify FPGrowth transform
> ---
>
> Key: SPARK-20670
> URL: https://issues.apache.org/jira/browse/SPARK-20670
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> As suggested by [~srowen] in https://github.com/apache/spark/pull/17130, the 
> transform code in FPGrowthModel can be simplified. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20670) Simplify FPGrowth transform

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20670:


Assignee: Apache Spark

> Simplify FPGrowth transform
> ---
>
> Key: SPARK-20670
> URL: https://issues.apache.org/jira/browse/SPARK-20670
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: Apache Spark
>Priority: Minor
>
> As suggested by [~srowen] in https://github.com/apache/spark/pull/17130, the 
> transform code in FPGrowthModel can be simplified. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20670) Simplify FPGrowth transform

2017-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002098#comment-16002098
 ] 

Apache Spark commented on SPARK-20670:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/17912

> Simplify FPGrowth transform
> ---
>
> Key: SPARK-20670
> URL: https://issues.apache.org/jira/browse/SPARK-20670
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> As suggested by [~srowen] in https://github.com/apache/spark/pull/17130, the 
> transform code in FPGrowthModel can be simplified. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20671) Processing muitple kafka topics with single spark streaming context hangs on batchSubmitted.

2017-05-08 Thread amit kumar (JIRA)

amit kumar created SPARK-20671:
--

 Summary: Processing muitple kafka topics with single spark 
streaming context hangs on batchSubmitted.
 Key: SPARK-20671
 URL: https://issues.apache.org/jira/browse/SPARK-20671
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.0.0
 Environment: Ubuntu
Reporter: amit kumar



object SparkMain extends App {
 System.setProperty("spark.cassandra.connection.host", "127.0.0.1")
 val conf = new 
SparkConf().setMaster("local[2]").setAppName("kafkaspark").set("spark.streaming.concurrentJobs","4")
 val sc = new SparkContext(conf)
 val ssc = new StreamingContext(sc, Seconds(5))
 val sqlContext= new SQLContext(sc)
 val host = "localhost:2181"
 val topicList = List("test","fb")
 topicList.foreach{
   topic=> val lines =KafkaUtils.createStream(ssc, host, topic, Map(topic -> 
1)).map(_._2);
 //configureStream(topic, lines)
 lines.foreachRDD(rdd => 
rdd.map(test(_)).saveToCassandra("test","rawdata",SomeColumns("key")))
 }
  ssc.addStreamingListener(new StreamingListener {
   override def onBatchCompleted(batchCompleted: 
StreamingListenerBatchCompleted): Unit = {
 System.out.println("Batch completed, Total delay :" + 
batchCompleted.batchInfo.totalDelay.get.toString + " ms")
   }
override def onReceiverStarted(receiverStarted: 
StreamingListenerReceiverStarted): Unit = {
 println("inside onReceiverStarted")
   }
override def onReceiverError(receiverError: 
StreamingListenerReceiverError): Unit = {
 println("inside onReceiverError")
   }
override def onReceiverStopped(receiverStopped: 
StreamingListenerReceiverStopped): Unit = {
 println("inside onReceiverStopped")
   }
override def onBatchSubmitted(batchSubmitted: 
StreamingListenerBatchSubmitted): Unit = {
 println("inside onBatchSubmitted")
   }
override def onBatchStarted(batchStarted: StreamingListenerBatchStarted): 
Unit = {
 println("inside onBatchStarted")
   }
 })
  ssc.start()
 println("===")
 ssc.awaitTermination()
}
case class test(key: String)



If i put any one of the topics at a time then each topic works.But when topic 
list has more than one topic, after getting the DataStream from kafka topic, it 
keeps printing "inside onBatchSubmitted". Thanks in advance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20668) Modify ScalaUDF to handle nullability.

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20668:


Assignee: Apache Spark

> Modify ScalaUDF to handle nullability.
> --
>
> Key: SPARK-20668
> URL: https://issues.apache.org/jira/browse/SPARK-20668
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>
> When registering Scala UDF, we can know if the udf will return nullable value 
> or not. {{ScalaUDF}} and related classes should handle the nullability.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20668) Modify ScalaUDF to handle nullability.

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20668:


Assignee: (was: Apache Spark)

> Modify ScalaUDF to handle nullability.
> --
>
> Key: SPARK-20668
> URL: https://issues.apache.org/jira/browse/SPARK-20668
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>
> When registering Scala UDF, we can know if the udf will return nullable value 
> or not. {{ScalaUDF}} and related classes should handle the nullability.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20668) Modify ScalaUDF to handle nullability.

2017-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002090#comment-16002090
 ] 

Apache Spark commented on SPARK-20668:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17911

> Modify ScalaUDF to handle nullability.
> --
>
> Key: SPARK-20668
> URL: https://issues.apache.org/jira/browse/SPARK-20668
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>
> When registering Scala UDF, we can know if the udf will return nullable value 
> or not. {{ScalaUDF}} and related classes should handle the nullability.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20670) Simplify FPGrowth transform

2017-05-08 Thread yuhao yang (JIRA)

yuhao yang created SPARK-20670:
--

 Summary: Simplify FPGrowth transform
 Key: SPARK-20670
 URL: https://issues.apache.org/jira/browse/SPARK-20670
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang
Priority: Minor


As suggested by [~srowen] in https://github.com/apache/spark/pull/17130, the 
transform code in FPGrowthModel can be simplified. 





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20669) LogisticRegression family should be case insensitive

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20669:


Assignee: Apache Spark

> LogisticRegression family should be case insensitive 
> -
>
> Key: SPARK-20669
> URL: https://issues.apache.org/jira/browse/SPARK-20669
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Trivial
>
> {{LogisticRegression}} family should be case insensitive 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20669) LogisticRegression family should be case insensitive

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20669:


Assignee: (was: Apache Spark)

> LogisticRegression family should be case insensitive 
> -
>
> Key: SPARK-20669
> URL: https://issues.apache.org/jira/browse/SPARK-20669
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> {{LogisticRegression}} family should be case insensitive 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20669) LogisticRegression family should be case insensitive

2017-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002086#comment-16002086
 ] 

Apache Spark commented on SPARK-20669:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/17910

> LogisticRegression family should be case insensitive 
> -
>
> Key: SPARK-20669
> URL: https://issues.apache.org/jira/browse/SPARK-20669
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> {{LogisticRegression}} family should be case insensitive 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20669) LogisticRegression family should be case insensitive

2017-05-08 Thread zhengruifeng (JIRA)

zhengruifeng created SPARK-20669:


 Summary: LogisticRegression family should be case insensitive 
 Key: SPARK-20669
 URL: https://issues.apache.org/jira/browse/SPARK-20669
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: zhengruifeng
Priority: Trivial


{{LogisticRegression}} family should be case insensitive 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20661) SparkR tableNames() test fails

2017-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002082#comment-16002082
 ] 

Apache Spark commented on SPARK-20661:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/17909

> SparkR tableNames() test fails
> --
>
> Key: SPARK-20661
> URL: https://issues.apache.org/jira/browse/SPARK-20661
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>Assignee: Hossein Falaki
>  Labels: test
> Fix For: 2.2.0
>
>
> Due to prior state created by other test cases, testing {{tableNames()}} is 
> failing in master.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20668) Modify ScalaUDF to handle nullability.

2017-05-08 Thread Takuya Ueshin (JIRA)

Takuya Ueshin created SPARK-20668:
-

 Summary: Modify ScalaUDF to handle nullability.
 Key: SPARK-20668
 URL: https://issues.apache.org/jira/browse/SPARK-20668
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Takuya Ueshin


When registering Scala UDF, we can know if the udf will return nullable value 
or not. {{ScalaUDF}} and related classes should handle the nullability.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20667) Cleanup the cataloged metadata after completing the package of sql/core and sql/hive

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20667:


Assignee: Apache Spark  (was: Xiao Li)

> Cleanup the cataloged metadata after completing the package of sql/core and 
> sql/hive
> 
>
> Key: SPARK-20667
> URL: https://issues.apache.org/jira/browse/SPARK-20667
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> So far, we do not drop all the cataloged tables after each package. 
> Sometimes, we might hit strange test case errors because the previous test 
> suite did not drop the tables/functions/database. At least, we can first 
> clean up the environment when completing the package of sql/core and sql/hive.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20667) Cleanup the cataloged metadata after completing the package of sql/core and sql/hive

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20667:


Assignee: Xiao Li  (was: Apache Spark)

> Cleanup the cataloged metadata after completing the package of sql/core and 
> sql/hive
> 
>
> Key: SPARK-20667
> URL: https://issues.apache.org/jira/browse/SPARK-20667
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> So far, we do not drop all the cataloged tables after each package. 
> Sometimes, we might hit strange test case errors because the previous test 
> suite did not drop the tables/functions/database. At least, we can first 
> clean up the environment when completing the package of sql/core and sql/hive.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20667) Cleanup the cataloged metadata after completing the package of sql/core and sql/hive

2017-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002035#comment-16002035
 ] 

Apache Spark commented on SPARK-20667:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17908

> Cleanup the cataloged metadata after completing the package of sql/core and 
> sql/hive
> 
>
> Key: SPARK-20667
> URL: https://issues.apache.org/jira/browse/SPARK-20667
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> So far, we do not drop all the cataloged tables after each package. 
> Sometimes, we might hit strange test case errors because the previous test 
> suite did not drop the tables/functions/database. At least, we can first 
> clean up the environment when completing the package of sql/core and sql/hive.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20667) Cleanup the cataloged metadata after completing the package of sql/core and sql/hive

2017-05-08 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20667:

Description: So far, we do not drop all the cataloged tables after each 
package. Sometimes, we might hit strange test case errors because the previous 
test suite did not drop the tables/functions/database. At least, we can first 
clean up the environment when completing the package of sql/core and sql/hive.  
(was: So far, we do not drop all the cataloged tables after each package. 
Sometimes, we might hit strange test case errors due to the previous test 
suite. At least, we can first clean up the environment when completing the 
package of sql/core and sql/hive)

> Cleanup the cataloged metadata after completing the package of sql/core and 
> sql/hive
> 
>
> Key: SPARK-20667
> URL: https://issues.apache.org/jira/browse/SPARK-20667
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> So far, we do not drop all the cataloged tables after each package. 
> Sometimes, we might hit strange test case errors because the previous test 
> suite did not drop the tables/functions/database. At least, we can first 
> clean up the environment when completing the package of sql/core and sql/hive.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20667) Cleanup the cataloged metadata after completing the package of sql/core and sql/hive

2017-05-08 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20667:

Description: So far, we do not drop all the cataloged tables after each 
package. Sometimes, we might hit strange test case errors due to the previous 
test suite. At least, we can first clean up the environment when completing the 
package of sql/core and sql/hive  (was: So far, we did not drop all the 
cataloged tables after each package. Sometimes, we might hit strange test case 
errors due to the previous test suite. At least, we can first clean up the 
environment when completing the package of sql/core and sql/hive)

> Cleanup the cataloged metadata after completing the package of sql/core and 
> sql/hive
> 
>
> Key: SPARK-20667
> URL: https://issues.apache.org/jira/browse/SPARK-20667
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> So far, we do not drop all the cataloged tables after each package. 
> Sometimes, we might hit strange test case errors due to the previous test 
> suite. At least, we can first clean up the environment when completing the 
> package of sql/core and sql/hive



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20667) Cleanup the cataloged metadata after completing the package of sql/core and sql/hive

2017-05-08 Thread Xiao Li (JIRA)

Xiao Li created SPARK-20667:
---

 Summary: Cleanup the cataloged metadata after completing the 
package of sql/core and sql/hive
 Key: SPARK-20667
 URL: https://issues.apache.org/jira/browse/SPARK-20667
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li
Assignee: Xiao Li


So far, we did not drop all the cataloged tables after each package. Sometimes, 
we might hit strange test case errors due to the previous test suite. At least, 
we can first clean up the environment when completing the package of sql/core 
and sql/hive



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20666) Flaky test - SparkListenerBus randomly failing on Windows

2017-05-08 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20666:
-
Summary: Flaky test - SparkListenerBus randomly failing on Windows  (was: 
Flaky test - random ml test failure on Windows)

> Flaky test - SparkListenerBus randomly failing on Windows
> -
>
> Key: SPARK-20666
> URL: https://issues.apache.org/jira/browse/SPARK-20666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core, SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>
> seeing quite a bit of this on AppVeyor, aka Windows only
> {code}
> Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 159454
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
>   at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
> 1
> MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
> C:\projects\spark\bin\..
> {code}
> {code}
> java.lang.IllegalStateException: SparkContext has been shutdown
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2906)
>

[jira] [Updated] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError on Windows

2017-05-08 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20666:
-
Summary: Flaky test - SparkListenerBus randomly failing 
java.lang.IllegalAccessError on Windows  (was: Flaky test - SparkListenerBus 
randomly failing on Windows)

> Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError 
> on Windows
> --
>
> Key: SPARK-20666
> URL: https://issues.apache.org/jira/browse/SPARK-20666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core, SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>
> seeing quite a bit of this on AppVeyor, aka Windows only, always only when 
> running ML tests, it seems
> {code}
> Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 159454
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
>   at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
> 1
> MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
> C:\projects\spark\bin\..
> {code}
> {code}
> java.lang.IllegalStateException: SparkContext has been shutdown
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907)
>   at 
>

[jira] [Updated] (SPARK-20666) Flaky test - SparkListenerBus randomly failing on Windows

2017-05-08 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20666:
-
Description: 
seeing quite a bit of this on AppVeyor, aka Windows only, always only when 
running ML tests, it seems

{code}
Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: Attempted 
to access garbage collected accumulator 159454
at 
org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
at 
org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88)
at 
org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67)
at 
org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
at 
org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216)
at 
org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
at 
org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at 
org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
at 
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
1
MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..

{code}

{code}
java.lang.IllegalStateException: SparkContext has been shutdown
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923)
at 
org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
at 
org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2906)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2474)
at org.apache.spark.sql.api.r.SQLUtils$.dfToCols(SQLUtils.scala:173)
at org.apache.spark.sql.api.r.SQLUtils.dfToCols(SQLUtils.scala)
at sun.reflect.GeneratedMethodAccessor104.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at

[jira] [Updated] (SPARK-20666) Flaky test - SparkListenerBus randomly failing on Windows

2017-05-08 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20666:
-
Component/s: Spark Core

> Flaky test - SparkListenerBus randomly failing on Windows
> -
>
> Key: SPARK-20666
> URL: https://issues.apache.org/jira/browse/SPARK-20666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core, SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>
> seeing quite a bit of this on AppVeyor, aka Windows only
> {code}
> Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 159454
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
>   at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
> 1
> MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
> C:\projects\spark\bin\..
> {code}
> {code}
> java.lang.IllegalStateException: SparkContext has been shutdown
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2906)
>   at org.apache.spark.sql.Dataset.collect(Dataset.scala:2474)
>   at

[jira] [Updated] (SPARK-20666) Flaky test - random ml test failure on Windows

2017-05-08 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20666:
-
Description: 
seeing quite a bit of this on AppVeyor, aka Windows only

{code}
Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: Attempted 
to access garbage collected accumulator 159454
at 
org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
at 
org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88)
at 
org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67)
at 
org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
at 
org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216)
at 
org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
at 
org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at 
org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
at 
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
1
MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
C:\projects\spark\bin\..

{code}

{code}
java.lang.IllegalStateException: SparkContext has been shutdown
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923)
at 
org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
at 
org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2906)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2474)
at org.apache.spark.sql.api.r.SQLUtils$.dfToCols(SQLUtils.scala:173)
at org.apache.spark.sql.api.r.SQLUtils.dfToCols(SQLUtils.scala)
at sun.reflect.GeneratedMethodAccessor104.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:167)
at

[jira] [Created] (SPARK-20666) Flaky test - random ml test failure on Windows

2017-05-08 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-20666:


 Summary: Flaky test - random ml test failure on Windows
 Key: SPARK-20666
 URL: https://issues.apache.org/jira/browse/SPARK-20666
 Project: Spark
  Issue Type: Bug
  Components: ML, SparkR
Affects Versions: 2.3.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20666) Flaky test - random ml test failure on Windows

2017-05-08 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20666:
-
Description: 
seeing quite a bit of this on AppVeyor, aka Windows only

{code}
java.lang.IllegalStateException: SparkContext has been shutdown
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923)
at 
org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
at 
org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2906)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2474)
at org.apache.spark.sql.api.r.SQLUtils$.dfToCols(SQLUtils.scala:173)
at org.apache.spark.sql.api.r.SQLUtils.dfToCols(SQLUtils.scala)
at sun.reflect.GeneratedMethodAccessor104.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:167)
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:108)
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:40)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)

[jira] [Commented] (SPARK-19581) running NaiveBayes model with 0 features can crash the executor with D rorreGEMV

2017-05-08 Thread 颜发才


[ 
https://issues.apache.org/jira/browse/SPARK-19581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001987#comment-16001987
 ] 

Yan Facai (颜发才) commented on SPARK-19581:
-

[~barrybecker4] Could you give a sample code?

> running NaiveBayes model with 0 features can crash the executor with D 
> rorreGEMV
> 
>
> Key: SPARK-19581
> URL: https://issues.apache.org/jira/browse/SPARK-19581
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
> Environment: spark development or standalone mode on windows or linux.
>Reporter: Barry Becker
>Priority: Minor
>
> The severity of this bug is high (because nothing should cause spark to crash 
> like this) but the priority may be low (because there is an easy workaround).
> In our application, a user can select features and a target to run the 
> NaiveBayes inducer. If columns have too many values or all one value, they 
> will be removed before we call the inducer to create the model. As a result, 
> there are some cases, where all the features may get removed. When this 
> happens, executors will crash and get restarted (if on a cluster) or spark 
> will crash and need to be manually restarted (if in development mode).
> It looks like NaiveBayes uses BLAS, and BLAS does not handle this case well 
> when it is encountered. I emits this vague error :
> ** On entry to DGEMV  parameter number  6 had an illegal value
> and terminates.
> My code looks like this:
> {code}
>val predictions = model.transform(testData)  // Make predictions
> // figure out how many were correctly predicted
> val numCorrect = predictions.filter(new Column(actualTarget) === new 
> Column(PREDICTION_LABEL_COLUMN)).count()
> val numIncorrect = testRowCount - numCorrect
> {code}
> The failure is at the line that does the count, but it is not the count that 
> causes the problem, it is the model.transform step (where the model contains 
> the NaiveBayes classifier).
> Here is the stack trace (in development mode):
> {code}
> [2017-02-13 06:28:39,946] TRACE evidence.EvidenceVizModel$ [] 
> [akka://JobServer/user/context-supervisor/sql-context] -  done making 
> predictions in 232
>  ** On entry to DGEMV  parameter number  6 had an illegal value
>  ** On entry to DGEMV  parameter number  6 had an illegal value
>  ** On entry to DGEMV  parameter number  6 had an illegal value
> [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] 
> [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has 
> already stopped! Dropping event SparkListenerSQLExecutionEnd(9,1486996120505)
> [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] 
> [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@1f6c4a29)
> [2017-02-13 06:28:40,508] ERROR .scheduler.LiveListenerBus [] 
> [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerJobEnd(12,1486996120507,JobFailed(org.apache.spark.SparkException:
>  Job 12 cancelled because SparkContext was shut down))
> [2017-02-13 06:28:40,509] ERROR .jobserver.JobManagerActor [] 
> [akka://JobServer/user/context-supervisor/sql-context] - Got Throwable
> org.apache.spark.SparkException: Job 12 cancelled because SparkContext was 
> shut down
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:808)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:806)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
> at 
> org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:806)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1668)
> at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83)
> at 
> org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1587)
> at 
> org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1826)
> at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1825)
> at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:581)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
> at 
>

[jira] [Assigned] (SPARK-7856) Scalable PCA implementation for tall and fat matrices

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7856:
---

Assignee: Apache Spark

> Scalable PCA implementation for tall and fat matrices
> -
>
> Key: SPARK-7856
> URL: https://issues.apache.org/jira/browse/SPARK-7856
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Tarek Elgamal
>Assignee: Apache Spark
>
> Currently the PCA implementation has a limitation of fitting d^2 
> covariance/grammian matrix entries in memory (d is the number of 
> columns/dimensions of the matrix). We often need only the largest k principal 
> components. To make pca really scalable, I suggest an implementation where 
> the memory usage is proportional to the principal components k rather than 
> the full dimensionality d. 
> I suggest adopting the solution described in this paper that is published in 
> SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). 
> The paper offers an implementation for Probabilistic PCA (PPCA) which has 
> less memory and time complexity and could potentially scale to tall and fat 
> matrices rather than tall and skinny matrices that is supported by the 
> current PCA impelmentation. 
> Probablistic PCA could be potentially added to the set of algorithms 
> supported by MLlib and it does not necessarily replace the old PCA 
> implementation.
> PPCA implementation is adopted in Matlab's Statistics and Machine Learning 
> Toolbox (http://www.mathworks.com/help/stats/ppca.html)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7856) Scalable PCA implementation for tall and fat matrices

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7856:
---

Assignee: (was: Apache Spark)

> Scalable PCA implementation for tall and fat matrices
> -
>
> Key: SPARK-7856
> URL: https://issues.apache.org/jira/browse/SPARK-7856
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Tarek Elgamal
>
> Currently the PCA implementation has a limitation of fitting d^2 
> covariance/grammian matrix entries in memory (d is the number of 
> columns/dimensions of the matrix). We often need only the largest k principal 
> components. To make pca really scalable, I suggest an implementation where 
> the memory usage is proportional to the principal components k rather than 
> the full dimensionality d. 
> I suggest adopting the solution described in this paper that is published in 
> SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). 
> The paper offers an implementation for Probabilistic PCA (PPCA) which has 
> less memory and time complexity and could potentially scale to tall and fat 
> matrices rather than tall and skinny matrices that is supported by the 
> current PCA impelmentation. 
> Probablistic PCA could be potentially added to the set of algorithms 
> supported by MLlib and it does not necessarily replace the old PCA 
> implementation.
> PPCA implementation is adopted in Matlab's Statistics and Machine Learning 
> Toolbox (http://www.mathworks.com/help/stats/ppca.html)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7856) Scalable PCA implementation for tall and fat matrices

2017-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001976#comment-16001976
 ] 

Apache Spark commented on SPARK-7856:
-

User 'ghoto' has created a pull request for this issue:
https://github.com/apache/spark/pull/17907

> Scalable PCA implementation for tall and fat matrices
> -
>
> Key: SPARK-7856
> URL: https://issues.apache.org/jira/browse/SPARK-7856
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Tarek Elgamal
>
> Currently the PCA implementation has a limitation of fitting d^2 
> covariance/grammian matrix entries in memory (d is the number of 
> columns/dimensions of the matrix). We often need only the largest k principal 
> components. To make pca really scalable, I suggest an implementation where 
> the memory usage is proportional to the principal components k rather than 
> the full dimensionality d. 
> I suggest adopting the solution described in this paper that is published in 
> SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). 
> The paper offers an implementation for Probabilistic PCA (PPCA) which has 
> less memory and time complexity and could potentially scale to tall and fat 
> matrices rather than tall and skinny matrices that is supported by the 
> current PCA impelmentation. 
> Probablistic PCA could be potentially added to the set of algorithms 
> supported by MLlib and it does not necessarily replace the old PCA 
> implementation.
> PPCA implementation is adopted in Matlab's Statistics and Machine Learning 
> Toolbox (http://www.mathworks.com/help/stats/ppca.html)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20665) Spark-sql, "Bround" function return NULL

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20665:


Assignee: (was: Apache Spark)

> Spark-sql, "Bround" function return NULL
> 
>
> Key: SPARK-20665
> URL: https://issues.apache.org/jira/browse/SPARK-20665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: liuxian
>
> >select bround(12.3, 2);
> >NULL
> For  this case, the expected result is 12.3, but it is null



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20665) Spark-sql, "Bround" function return NULL

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20665:


Assignee: Apache Spark

> Spark-sql, "Bround" function return NULL
> 
>
> Key: SPARK-20665
> URL: https://issues.apache.org/jira/browse/SPARK-20665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: liuxian
>Assignee: Apache Spark
>
> >select bround(12.3, 2);
> >NULL
> For  this case, the expected result is 12.3, but it is null



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20665) Spark-sql, "Bround" function return NULL

2017-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001970#comment-16001970
 ] 

Apache Spark commented on SPARK-20665:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/17906

> Spark-sql, "Bround" function return NULL
> 
>
> Key: SPARK-20665
> URL: https://issues.apache.org/jira/browse/SPARK-20665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: liuxian
>
> >select bround(12.3, 2);
> >NULL
> For  this case, the expected result is 12.3, but it is null



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20661) SparkR tableNames() test fails

2017-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001966#comment-16001966
 ] 

Apache Spark commented on SPARK-20661:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/17905

> SparkR tableNames() test fails
> --
>
> Key: SPARK-20661
> URL: https://issues.apache.org/jira/browse/SPARK-20661
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>Assignee: Hossein Falaki
>  Labels: test
> Fix For: 2.2.0
>
>
> Due to prior state created by other test cases, testing {{tableNames()}} is 
> failing in master.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20665) Spark-sql, "Bround" function return NULL

2017-05-08 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-20665:

Description: 
>select bround(12.3, 2);
>NULL
For  this case, the expected result is 12.3, but it is null

  was:
>select bround(12.3, 2);
>NULL
For  this case, we expected the result is 12.3, but it is null


> Spark-sql, "Bround" function return NULL
> 
>
> Key: SPARK-20665
> URL: https://issues.apache.org/jira/browse/SPARK-20665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: liuxian
>
> >select bround(12.3, 2);
> >NULL
> For  this case, the expected result is 12.3, but it is null



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20665) Spark-sql, "Bround" function return NULL

2017-05-08 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-20665:

Description: 
>select bround(12.3, 2);
>NULL
For  this case, we expected the result is 12.3, but it is null

  was:
>select bround(12.3, 2);
>NULL
For  this case, we expected de result is 12.3, but it is null


> Spark-sql, "Bround" function return NULL
> 
>
> Key: SPARK-20665
> URL: https://issues.apache.org/jira/browse/SPARK-20665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: liuxian
>
> >select bround(12.3, 2);
> >NULL
> For  this case, we expected the result is 12.3, but it is null



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20665) Spark-sql, "Bround" function return NULL

2017-05-08 Thread liuxian (JIRA)

liuxian created SPARK-20665:
---

 Summary: Spark-sql, "Bround" function return NULL
 Key: SPARK-20665
 URL: https://issues.apache.org/jira/browse/SPARK-20665
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: liuxian


>select bround(12.3, 2);
>NULL
For  this case, we expected de result is 12.3, but it is null



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2017-05-08 Thread sandflee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001924#comment-16001924
 ] 

sandflee commented on SPARK-18278:
--

for a spark user, what benefits could we get？ beside could co-run docker app 
and spark app

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect

2017-05-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001917#comment-16001917
 ] 

Saisai Shao commented on SPARK-20658:
-

It is mainly depends on YARN to measure the failure validity interval and how 
to define a failure AM, Spark just proxy this parameter to YARN. So if there's 
any unexpected behavior I think we should investigate on YARN part to see the 
actual behavior.

> spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
> 
>
> Key: SPARK-20658
> URL: https://issues.apache.org/jira/browse/SPARK-20658
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Paul Jones
>Priority: Minor
>
> I'm running a job in YARN cluster mode using 
> `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both 
> spark-default.conf and in my spark-submit command. (This flag shows up in the 
> environment tab of spark history server, so it seems that it's specified 
> correctly). 
> However, I just had a job die with with four AM failures (three of the four 
> failures were over an hour apart). So, I'm confused as to what could be going 
> on. I haven't figured out the cause of the individual failures, so is it 
> possible that we always count certain types of failures? E.g. jobs that are 
> killed due to memory issues always count? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7856) Scalable PCA implementation for tall and fat matrices

2017-05-08 Thread Ignacio Bermudez Corrales (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001551#comment-16001551
 ] 

Ignacio Bermudez Corrales edited comment on SPARK-7856 at 5/9/17 1:59 AM:
--

Apart from implementing Probabilistic PCA (which in my view is a different 
algorithm worth implementing as a separate algorithm), there are some issues 
with the current (2.3) implementation of 
RowMatrix.computePrincipalComponentsAndExplainedVariance that affect the PCA 
training.

In my opinion the Big problem with the current implementation is the line 387 
of RowMatrix.scala, which causes OutOfMemory exceptions for this kind of 
matrices, as it computes the covariance as a local breeze dense matrix.

 val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]]

The implementation computes a dense covariance local breeze matrix, which is 
not needed for the computation of the principal components nor explained 
variance.

In particular, RowMatrix provides a more optimized SVD decomposition. 
Therefore, principal components and variance can be derived from this 
implementation of the decomposition, by computing the (X - µ).computeSVD( k, 
false, 0). Which leads to a more scalable implementation of PCA for tall and 
fat matrices.

If this ticket is for the implementation of PPCA, it should be specified in the 
title.


was (Author: elghoto):
Apart from implementing Probabilistic PCA (which in my view is a different 
algorithm worth implementing as a separate algorithm), there are some issues 
with the current (2.3) implementation of 
RowMatrix.computePrincipalComponentsAndExplainedVariance that affect the PCA 
training.

In my opinion the Big problem with the current implementation is the line 387 
of RowMatrix.scala, which causes OutOfMemory exceptions for this kind of 
matrices, as it computes the covariance as a local breeze dense matrix.

 val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]]

The implementation computes a dense covariance local breeze matrix, which is 
not needed for the computation of the principal components nor explained 
variance.

In particular, RowMatrix provides a more optimized SVD decomposition. 
Therefore, principal components and variance can be derived from this 
implementation of the decomposition, by computing the (X - µ).computeSVD( k, 
false, 0).

> Scalable PCA implementation for tall and fat matrices
> -
>
> Key: SPARK-7856
> URL: https://issues.apache.org/jira/browse/SPARK-7856
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Tarek Elgamal
>
> Currently the PCA implementation has a limitation of fitting d^2 
> covariance/grammian matrix entries in memory (d is the number of 
> columns/dimensions of the matrix). We often need only the largest k principal 
> components. To make pca really scalable, I suggest an implementation where 
> the memory usage is proportional to the principal components k rather than 
> the full dimensionality d. 
> I suggest adopting the solution described in this paper that is published in 
> SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). 
> The paper offers an implementation for Probabilistic PCA (PPCA) which has 
> less memory and time complexity and could potentially scale to tall and fat 
> matrices rather than tall and skinny matrices that is supported by the 
> current PCA impelmentation. 
> Probablistic PCA could be potentially added to the set of algorithms 
> supported by MLlib and it does not necessarily replace the old PCA 
> implementation.
> PPCA implementation is adopted in Matlab's Statistics and Machine Learning 
> Toolbox (http://www.mathworks.com/help/stats/ppca.html)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7856) Scalable PCA implementation for tall and fat matrices

2017-05-08 Thread Ignacio Bermudez Corrales (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001551#comment-16001551
 ] 

Ignacio Bermudez Corrales edited comment on SPARK-7856 at 5/9/17 2:00 AM:
--

Apart from implementing Probabilistic PCA (which in my view is a different 
algorithm worth implementing separate from PCA), there are some issues with the 
current (2.3) implementation of 
RowMatrix.computePrincipalComponentsAndExplainedVariance that affect the PCA 
training.

In my opinion the Big problem with the current implementation is the line 387 
of RowMatrix.scala, which causes OutOfMemory exceptions for this kind of 
matrices, as it computes the covariance as a local breeze dense matrix.

 val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]]

The implementation computes a dense covariance local breeze matrix, which is 
not needed for the computation of the principal components nor explained 
variance.

In particular, RowMatrix provides a more optimized SVD decomposition. 
Therefore, principal components and variance can be derived from this 
implementation of the decomposition, by computing the (X - µ).computeSVD( k, 
false, 0). Which leads to a more scalable implementation of PCA for tall and 
fat matrices.

If this ticket is for the implementation of PPCA, it should be specified in the 
title.


was (Author: elghoto):
Apart from implementing Probabilistic PCA (which in my view is a different 
algorithm worth implementing as a separate algorithm), there are some issues 
with the current (2.3) implementation of 
RowMatrix.computePrincipalComponentsAndExplainedVariance that affect the PCA 
training.

In my opinion the Big problem with the current implementation is the line 387 
of RowMatrix.scala, which causes OutOfMemory exceptions for this kind of 
matrices, as it computes the covariance as a local breeze dense matrix.

 val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]]

The implementation computes a dense covariance local breeze matrix, which is 
not needed for the computation of the principal components nor explained 
variance.

In particular, RowMatrix provides a more optimized SVD decomposition. 
Therefore, principal components and variance can be derived from this 
implementation of the decomposition, by computing the (X - µ).computeSVD( k, 
false, 0). Which leads to a more scalable implementation of PCA for tall and 
fat matrices.

If this ticket is for the implementation of PPCA, it should be specified in the 
title.

> Scalable PCA implementation for tall and fat matrices
> -
>
> Key: SPARK-7856
> URL: https://issues.apache.org/jira/browse/SPARK-7856
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Tarek Elgamal
>
> Currently the PCA implementation has a limitation of fitting d^2 
> covariance/grammian matrix entries in memory (d is the number of 
> columns/dimensions of the matrix). We often need only the largest k principal 
> components. To make pca really scalable, I suggest an implementation where 
> the memory usage is proportional to the principal components k rather than 
> the full dimensionality d. 
> I suggest adopting the solution described in this paper that is published in 
> SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). 
> The paper offers an implementation for Probabilistic PCA (PPCA) which has 
> less memory and time complexity and could potentially scale to tall and fat 
> matrices rather than tall and skinny matrices that is supported by the 
> current PCA impelmentation. 
> Probablistic PCA could be potentially added to the set of algorithms 
> supported by MLlib and it does not necessarily replace the old PCA 
> implementation.
> PPCA implementation is adopted in Matlab's Statistics and Machine Learning 
> Toolbox (http://www.mathworks.com/help/stats/ppca.html)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location

2017-05-08 Thread kobefeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kobefeng updated SPARK-20663:
-
Description: 
Use spark sql to create partition table first, and alter table by adding 
partition on specific location, then insert overwrite into this partition by 
selection, which will cause data missing compared with HIVE.

{code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
-- create partition table first
$ hadoop fs -mkdir /user/kofeng/partitioned_table
$ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
spark-sql> create table kofeng.partitioned_table(
 > id bigint,
 > name string,
 > dt string
 > ) using parquet options ('compression'='snappy', 
'path'='/user/kofeng/partitioned_table')
 > partitioned by (dt);

-- add partition with specific location
spark-sql> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';

$ hadoop fs -ls /user/kofeng/partitioned_table
drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
/user/kofeng/partitioned_table/20170507

-- insert overwrite this partition, and the specific location folder gone, data 
is missing, job is success by attaching _SUCCESS
spark-sql> insert overwrite table kofeng.partitioned_table 
partition(dt='20170507') select 123 as id, "kofeng" as name;

$ hadoop fs -ls /user/kofeng/partitioned_table
-rw-r--r--   3 kofeng kofeng 0 2017-05-08 17:06 
/user/kofeng/partitioned_table/_SUCCESS



-- Then drop this partition and use hive to add partition and insert overwrite 
this partition data, then verify:
spark-sql> alter table kofeng.partitioned_table drop if exists 
partition(dt='20170507');

hive> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
OK

-- could see hive insert overwrite that specific location successfully.

hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') 
select 123 as id, "kofeng" as name from kofeng.test;
hive> select * from kofeng.partitioned_table;
OK
123 kofeng  20170507

$ hadoop fs -ls /user/kofeng/partitioned_table/20170507
-rwxr-xr-x   3 kofeng kofeng   338 2017-05-08 17:26 
/user/kofeng/partitioned_table/20170507/00_0
{code}

  was:
Use spark sql to create partition table first, and alter table by adding 
partition on specific location, then insert overwrite into this partition by 
selection, which will cause data missing compared with HIVE.

{code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
-- create partition table first
$ hadoop fs -mkdir /user/kofeng/partitioned_table
$ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
spark-sql> create table kofeng.partitioned_table(
 > id bigint,
 > name string,
 > dt string
 > ) using parquet options ('compression'='snappy', 
'path'='/user/kofeng/partitioned_table')
 > partitioned by (dt);

-- add partition with specific location
spark-sql> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';

$ hadoop fs -ls /user/kofeng/partitioned_table
drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
/user/kofeng/partitioned_table/20170507

-- insert overwrite this partition, and the specific location folder gone, data 
is missing, job is success by attaching _SUCCESS
spark-sql> insert overwrite table kofeng.partitioned_table 
partition(dt='20170507') select 123 as id, "kofeng" as name;

$ hadoop fs -ls /user/kofeng/partitioned_table
-rw-r--r--   3 kofeng kofeng 0 2017-05-08 17:06 
/user/kofeng/partitioned_table/_SUCCESS



-- Then drop this partition and use hive to add partition and insert overwrite 
this partition data, then verify:
spark-sql> alter table kofeng.partitioned_table drop if exists 
partition(dt='20170507');

hive> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
OK

-- could see hive also drop the specific location but data is preserved on 
auto-created partition folder

hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') 
select 123 as id, "kofeng" as name from kofeng.test;
hive> select * from kofeng.partitioned_table;
OK
123 kofeng  20170507

$ hadoop fs -ls /user/kofeng/partitioned_table/20170507
-rwxr-xr-x   3 kofeng kofeng   338 2017-05-08 17:26 
/user/kofeng/partitioned_table/20170507/00_0
{code}


> Data missing after insert overwrite table partition which is created on 
> specific location
> -
>
> Key: SPARK-20663
> URL: https://issues.apache.org/jira/browse/SPARK-20663
> Project:

[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location

2017-05-08 Thread kobefeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kobefeng updated SPARK-20663:
-
Description: 
Use spark sql to create partition table first, and alter table by adding 
partition on specific location, then insert overwrite into this partition by 
selection, which will cause data missing compared with HIVE.

{code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
-- create partition table first
$ hadoop fs -mkdir /user/kofeng/partitioned_table
$ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
spark-sql> create table kofeng.partitioned_table(
 > id bigint,
 > name string,
 > dt string
 > ) using parquet options ('compression'='snappy', 
'path'='/user/kofeng/partitioned_table')
 > partitioned by (dt);

-- add partition with specific location
spark-sql> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';

$ hadoop fs -ls /user/kofeng/partitioned_table
drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
/user/kofeng/partitioned_table/20170507

-- insert overwrite this partition, and the specific location folder gone, data 
is missing, job is success by attaching _SUCCESS
spark-sql> insert overwrite table kofeng.partitioned_table 
partition(dt='20170507') select 123 as id, "kofeng" as name;

$ hadoop fs -ls /user/kofeng/partitioned_table
-rw-r--r--   3 kofeng kofeng 0 2017-05-08 17:06 
/user/kofeng/partitioned_table/_SUCCESS



-- Then drop this partition and use hive to add partition and insert overwrite 
this partition data, then verify:
spark-sql> alter table kofeng.partitioned_table drop if exists 
partition(dt='20170507');

hive> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
OK

-- could see hive also drop the specific location but data is preserved on 
auto-created partition folder

hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') 
select 123 as id, "kofeng" as name from kofeng.test;
hive> select * from kofeng.partitioned_table;
OK
123 kofeng  20170507

$ hadoop fs -ls /user/kofeng/partitioned_table/20170507
-rwxr-xr-x   3 kofeng kofeng   338 2017-05-08 17:26 
/user/kofeng/partitioned_table/20170507/00_0
{code}

  was:
Use spark sql to create partition table first, and alter table by adding 
partition on specific location, then insert overwrite into this partition by 
selection, which will cause data missing compared with HIVE.

{code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
-- create partition table first
$ hadoop fs -mkdir /user/kofeng/partitioned_table
$ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
spark-sql> create table kofeng.partitioned_table(
 > id bigint,
 > name string,
 > dt string
 > ) using parquet options ('compression'='snappy', 
'path'='/user/kofeng/partitioned_table')
 > partitioned by (dt);

-- add partition with specific location
spark-sql> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';

$ hadoop fs -ls /user/kofeng/partitioned_table
drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
/user/kofeng/partitioned_table/20170507

-- insert overwrite this partition, and the specific location folder gone, data 
is missing, job is success by attaching _SUCCESS
spark-sql> insert overwrite table kofeng.partitioned_table 
partition(dt='20170507') select 123 as id, "kofeng" as name;

$ hadoop fs -ls /user/kofeng/partitioned_table
-rw-r--r--   3 kofeng kofeng 0 2017-05-08 17:06 
/user/kofeng/partitioned_table/_SUCCESS



-- Then drop this partition and use hive to add partition and insert overwrite 
this partition data, then verify:
spark-sql> alter table kofeng.partitioned_table drop if exists 
partition(dt='20170507');

hive> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
OK

-- could see hive also drop the specific location but data is preserved on 
auto-created partition folder

hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') 
select 123 as id, "kofeng" as name from kofeng.test;
Loading data to table kofeng.partitioned_table partition (dt=20170507)
Moved: '/user/kofeng/partitioned_table/dt=20170507/00_0' to trash 
at: /user/kofeng/.Trash/Current
Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, 
numRows=1, totalSize=338, rawDataSize=2]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2   Cumulative CPU: 10.61 sec   HDFS Read: 9767 
HDFS Write: 577 SUCCESS
Stage-Stage-3: Map: 1   Cumulative CPU: 12.36 sec   HDFS Read: 3635 
HDFS Write:

[jira] [Created] (SPARK-20664) Remove stale applications from SHS listing

2017-05-08 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-20664:
--

 Summary: Remove stale applications from SHS listing
 Key: SPARK-20664
 URL: https://issues.apache.org/jira/browse/SPARK-20664
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Marcelo Vanzin


See spec in parent issue (SPARK-18085) for more details.

This task is actually not explicit in the spec, and it's also an issue with the 
current SHS. But having the SHS persist listing data makes it worse.

Basically, the SHS currently does not detect when files are deleted from the 
event log directory manually; so those applications are still listed, and 
trying to see their UI will either show the UI (if it's loaded) or an error (if 
it's not).

With the new SHS, that also means that data is leaked in the disk stores used 
to persist listing and UI data, making the problem worse.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location

2017-05-08 Thread kobefeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kobefeng updated SPARK-20663:
-
Description: 
Use spark sql to create partition table first, and alter table by adding 
partition on specific location, then insert overwrite into this partition by 
selection, which will cause data missing compared with HIVE.

{code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
-- create partition table first
$ hadoop fs -mkdir /user/kofeng/partitioned_table
$ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
spark-sql> create table kofeng.partitioned_table(
 > id bigint,
 > name string,
 > dt string
 > ) using parquet options ('compression'='snappy', 
'path'='/user/kofeng/partitioned_table')
 > partitioned by (dt);

-- add partition with specific location
spark-sql> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';

$ hadoop fs -ls /user/kofeng/partitioned_table
drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
/user/kofeng/partitioned_table/20170507

-- insert overwrite this partition, and the specific location folder gone, data 
is missing, job is success by attaching _SUCCESS
spark-sql> insert overwrite table kofeng.partitioned_table 
partition(dt='20170507') select 123 as id, "kofeng" as name;

$ hadoop fs -ls /user/kofeng/partitioned_table
-rw-r--r--   3 kofeng kofeng 0 2017-05-08 17:06 
/user/kofeng/partitioned_table/_SUCCESS



-- Then drop this partition and use hive to add partition and insert overwrite 
this partition data, then verify:
spark-sql> alter table kofeng.partitioned_table drop if exists 
partition(dt='20170507');

hive> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
OK

-- could see hive also drop the specific location but data is preserved on 
auto-created partition folder

hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') 
select 123 as id, "kofeng" as name from kofeng.test;
Loading data to table kofeng.partitioned_table partition (dt=20170507)
Moved: '/user/kofeng/partitioned_table/dt=20170507/00_0' to trash 
at: /user/kofeng/.Trash/Current
Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, 
numRows=1, totalSize=338, rawDataSize=2]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2   Cumulative CPU: 10.61 sec   HDFS Read: 9767 
HDFS Write: 577 SUCCESS
Stage-Stage-3: Map: 1   Cumulative CPU: 12.36 sec   HDFS Read: 3635 
HDFS Write: 338 SUCCESS

hive> select * from kofeng.partitioned_table;
OK
123 kofeng  20170507

$ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507
-rwxr-xr-x   3 kofeng kofeng   338 2017-05-08 17:26 
/user/kofeng/partitioned_table/dt=20170507/00_0
{code}

  was:
Use spark sql to create partition table first, and alter table by adding 
partition on specific location, then insert overwrite into this partition by 
selection, which will cause data missing compared with HIVE.

{code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
-- create partition table first
$ hadoop fs -mkdir /user/kofeng/partitioned_table
$ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
spark-sql> create table kofeng.partitioned_table(
 > id bigint,
 > name string,
 > dt string
 > ) using parquet options ('compression'='snappy', 
'path'='/user/kofeng/partitioned_table')
 > partitioned by (dt);

-- add partition with specific location
spark-sql> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';

$ hadoop fs -ls /user/kofeng/partitioned_table
drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
/user/kofeng/partitioned_table/20170507

-- insert overwrite this partition, and the specific location folder gone, data 
is missing, job is success by attaching _SUCCESS
spark-sql> insert overwrite table kofeng.partitioned_table 
partition(dt='20170507') select 123 as id, "kofeng" as name;

$ hadoop fs -ls /user/kofeng/partitioned_table
-rw-r--r--   3 kofeng hdmi-technology  0 2017-05-08 17:06 
/user/kofeng/partitioned_table/_SUCCESS



-- Then drop this partition and use hive to add partition and insert overwrite 
this partition data, then verify:
spark-sql> alter table kofeng.partitioned_table drop if exists 
partition(dt='20170507');

hive> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
OK

-- could see hive also drop the specific location but data is preserved on 
auto-created partition folder

hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') 
select 123 as id,

[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location

2017-05-08 Thread kobefeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kobefeng updated SPARK-20663:
-
Priority: Major  (was: Minor)

> Data missing after insert overwrite table partition which is created on 
> specific location
> -
>
> Key: SPARK-20663
> URL: https://issues.apache.org/jira/browse/SPARK-20663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: kobefeng
>  Labels: easyfix
>
> Use spark sql to create partition table first, and alter table by adding 
> partition on specific location, then insert overwrite into this partition by 
> selection, which will cause data missing compared with HIVE.
> {code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
> -- create partition table first
> $ hadoop fs -mkdir /user/kofeng/partitioned_table
> $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
> spark-sql> create table kofeng.partitioned_table(
>  > id bigint,
>  > name string,
>  > dt string
>  > ) using parquet options ('compression'='snappy', 
> 'path'='/user/kofeng/partitioned_table')
>  > partitioned by (dt);
> -- add partition with specific location
> spark-sql> alter table kofeng.partitioned_table add if not exists 
> partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
> $ hadoop fs -ls /user/kofeng/partitioned_table
> drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
> /user/kofeng/partitioned_table/20170507
> -- insert overwrite this partition, and the specific location folder gone, 
> data is missing, job is success by attaching _SUCCESS
> spark-sql> insert overwrite table kofeng.partitioned_table 
> partition(dt='20170507') select 123 as id, "kofeng" as name;
> $ hadoop fs -ls /user/kofeng/partitioned_table
> -rw-r--r--   3 kofeng hdmi-technology  0 2017-05-08 17:06 
> /user/kofeng/partitioned_table/_SUCCESS
> 
> 
> -- Then drop this partition and use hive to add partition and insert 
> overwrite this partition data, then verify:
> spark-sql> alter table kofeng.partitioned_table drop if exists 
> partition(dt='20170507');
> hive> alter table kofeng.partitioned_table add if not exists 
> partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
> OK
> -- could see hive also drop the specific location but data is preserved on 
> auto-created partition folder
> hive> insert overwrite table kofeng.partitioned_table 
> partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test;
>   Loading data to table kofeng.partitioned_table partition (dt=20170507)
>   Moved: '/user/kofeng/partitioned_table/dt=20170507/00_0' to trash 
> at: /user/kofeng/.Trash/Current
>   Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, 
> numRows=1, totalSize=338, rawDataSize=2]
>   MapReduce Jobs Launched:
>   Stage-Stage-1: Map: 2   Cumulative CPU: 10.61 sec   HDFS Read: 9767 
> HDFS Write: 577 SUCCESS
>   Stage-Stage-3: Map: 1   Cumulative CPU: 12.36 sec   HDFS Read: 3635 
> HDFS Write: 338 SUCCESS
> hive> select * from kofeng.partitioned_table;
> OK
> 123   kofeng  20170507
> $ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507
> -rwxr-xr-x   3 kofeng hdmi-technology338 2017-05-08 17:26 
> /user/kofeng/partitioned_table/dt=20170507/00_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location

2017-05-08 Thread kobefeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kobefeng updated SPARK-20663:
-
Affects Version/s: (was: 2.1.1)

> Data missing after insert overwrite table partition which is created on 
> specific location
> -
>
> Key: SPARK-20663
> URL: https://issues.apache.org/jira/browse/SPARK-20663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: kobefeng
>Priority: Minor
>  Labels: easyfix
>
> Use spark sql to create partition table first, and alter table by adding 
> partition on specific location, then insert overwrite into this partition by 
> selection, which will cause data missing compared with HIVE.
> {code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
> -- create partition table first
> $ hadoop fs -mkdir /user/kofeng/partitioned_table
> $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
> spark-sql> create table kofeng.partitioned_table(
>  > id bigint,
>  > name string,
>  > dt string
>  > ) using parquet options ('compression'='snappy', 
> 'path'='/user/kofeng/partitioned_table')
>  > partitioned by (dt);
> -- add partition with specific location
> spark-sql> alter table kofeng.partitioned_table add if not exists 
> partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
> $ hadoop fs -ls /user/kofeng/partitioned_table
> drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
> /user/kofeng/partitioned_table/20170507
> -- insert overwrite this partition, and the specific location folder gone, 
> data is missing, job is success by attaching _SUCCESS
> spark-sql> insert overwrite table kofeng.partitioned_table 
> partition(dt='20170507') select 123 as id, "kofeng" as name;
> $ hadoop fs -ls /user/kofeng/partitioned_table
> -rw-r--r--   3 kofeng hdmi-technology  0 2017-05-08 17:06 
> /user/kofeng/partitioned_table/_SUCCESS
> 
> 
> -- Then drop this partition and use hive to add partition and insert 
> overwrite this partition data, then verify:
> spark-sql> alter table kofeng.partitioned_table drop if exists 
> partition(dt='20170507');
> hive> alter table kofeng.partitioned_table add if not exists 
> partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
> OK
> -- could see hive also drop the specific location but data is preserved on 
> auto-created partition folder
> hive> insert overwrite table kofeng.partitioned_table 
> partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test;
>   Loading data to table kofeng.partitioned_table partition (dt=20170507)
>   Moved: '/user/kofeng/partitioned_table/dt=20170507/00_0' to trash 
> at: /user/kofeng/.Trash/Current
>   Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, 
> numRows=1, totalSize=338, rawDataSize=2]
>   MapReduce Jobs Launched:
>   Stage-Stage-1: Map: 2   Cumulative CPU: 10.61 sec   HDFS Read: 9767 
> HDFS Write: 577 SUCCESS
>   Stage-Stage-3: Map: 1   Cumulative CPU: 12.36 sec   HDFS Read: 3635 
> HDFS Write: 338 SUCCESS
> hive> select * from kofeng.partitioned_table;
> OK
> 123   kofeng  20170507
> $ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507
> -rwxr-xr-x   3 kofeng hdmi-technology338 2017-05-08 17:26 
> /user/kofeng/partitioned_table/dt=20170507/00_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location

2017-05-08 Thread kobefeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kobefeng updated SPARK-20663:
-
Labels: easyfix  (was: )

> Data missing after insert overwrite table partition which is created on 
> specific location
> -
>
> Key: SPARK-20663
> URL: https://issues.apache.org/jira/browse/SPARK-20663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: kobefeng
>Priority: Minor
>  Labels: easyfix
>
> Use spark sql to create partition table first, and alter table by adding 
> partition on specific location, then insert overwrite into this partition by 
> selection, which will cause data missing compared with HIVE.
> {code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
> -- create partition table first
> $ hadoop fs -mkdir /user/kofeng/partitioned_table
> $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
> spark-sql> create table kofeng.partitioned_table(
>  > id bigint,
>  > name string,
>  > dt string
>  > ) using parquet options ('compression'='snappy', 
> 'path'='/user/kofeng/partitioned_table')
>  > partitioned by (dt);
> -- add partition with specific location
> spark-sql> alter table kofeng.partitioned_table add if not exists 
> partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
> $ hadoop fs -ls /user/kofeng/partitioned_table
> drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
> /user/kofeng/partitioned_table/20170507
> -- insert overwrite this partition, and the specific location folder gone, 
> data is missing, job is success by attaching _SUCCESS
> spark-sql> insert overwrite table kofeng.partitioned_table 
> partition(dt='20170507') select 123 as id, "kofeng" as name;
> $ hadoop fs -ls /user/kofeng/partitioned_table
> -rw-r--r--   3 kofeng hdmi-technology  0 2017-05-08 17:06 
> /user/kofeng/partitioned_table/_SUCCESS
> 
> 
> -- Then drop this partition and use hive to add partition and insert 
> overwrite this partition data, then verify:
> spark-sql> alter table kofeng.partitioned_table drop if exists 
> partition(dt='20170507');
> hive> alter table kofeng.partitioned_table add if not exists 
> partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
> OK
> -- could see hive also drop the specific location but data is preserved on 
> auto-created partition folder
> hive> insert overwrite table kofeng.partitioned_table 
> partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test;
>   Loading data to table kofeng.partitioned_table partition (dt=20170507)
>   Moved: '/user/kofeng/partitioned_table/dt=20170507/00_0' to trash 
> at: /user/kofeng/.Trash/Current
>   Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, 
> numRows=1, totalSize=338, rawDataSize=2]
>   MapReduce Jobs Launched:
>   Stage-Stage-1: Map: 2   Cumulative CPU: 10.61 sec   HDFS Read: 9767 
> HDFS Write: 577 SUCCESS
>   Stage-Stage-3: Map: 1   Cumulative CPU: 12.36 sec   HDFS Read: 3635 
> HDFS Write: 338 SUCCESS
> hive> select * from kofeng.partitioned_table;
> OK
> 123   kofeng  20170507
> $ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507
> -rwxr-xr-x   3 kofeng hdmi-technology338 2017-05-08 17:26 
> /user/kofeng/partitioned_table/dt=20170507/00_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location

2017-05-08 Thread kobefeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kobefeng updated SPARK-20663:
-
Description: 
Use spark sql to create partition table first, and alter table by adding 
partition on specific location, then insert overwrite into this partition by 
selection, which will cause data missing compared with HIVE.

{code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
-- create partition table first
$ hadoop fs -mkdir /user/kofeng/partitioned_table
$ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
spark-sql> create table kofeng.partitioned_table(
 > id bigint,
 > name string,
 > dt string
 > ) using parquet options ('compression'='snappy', 
'path'='hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table')
 > partitioned by (dt);

-- add partition with specific location
spark-sql> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';

$ hadoop fs -ls /user/kofeng/partitioned_table
drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
/user/kofeng/partitioned_table/20170507

-- insert overwrite this partition, and the specific location folder gone, data 
is missing, job is success by attaching _SUCCESS
spark-sql> insert overwrite table kofeng.partitioned_table 
partition(dt='20170507') select 123 as id, "kofeng" as name;

$ hadoop fs -ls /user/kofeng/partitioned_table
-rw-r--r--   3 kofeng hdmi-technology  0 2017-05-08 17:06 
/user/kofeng/partitioned_table/_SUCCESS



-- Then drop this partition and use hive to add partition and insert overwrite 
this partition data, then verify:
spark-sql> alter table kofeng.partitioned_table drop if exists 
partition(dt='20170507');

hive> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
OK

-- could see hive also drop the specific location but data is preserved on 
auto-created partition folder

hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') 
select 123 as id, "kofeng" as name from kofeng.test;
Loading data to table kofeng.partitioned_table partition (dt=20170507)
Moved: 
'hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table/dt=20170507/00_0' to 
trash at: hdfs://ares-lvs-nn-ha/user/kofeng/.Trash/Current
Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, 
numRows=1, totalSize=338, rawDataSize=2]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2   Cumulative CPU: 10.61 sec   HDFS Read: 9767 
HDFS Write: 577 SUCCESS
Stage-Stage-3: Map: 1   Cumulative CPU: 12.36 sec   HDFS Read: 3635 
HDFS Write: 338 SUCCESS

hive> select * from kofeng.partitioned_table;
OK
123 kofeng  20170507

$ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507
-rwxr-xr-x   3 kofeng hdmi-technology338 2017-05-08 17:26 
/user/kofeng/partitioned_table/dt=20170507/00_0
{code}

  was:
Use spark sql to create partition table first, and alter table by adding 
partition on specific location, then insert overwrite into this partition by 
selection, which will cause data missing compared with HIVE.

{code:title=Bar.sql|borderStyle=solid}
-- create partition table first
$ hadoop fs -mkdir /user/kofeng/partitioned_table
$ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
spark-sql> create table kofeng.partitioned_table(
 > id bigint,
 > name string,
 > dt string
 > ) using parquet options ('compression'='snappy', 
'path'='hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table')
 > partitioned by (dt);

-- add partition with specific location
spark-sql> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';

$ hadoop fs -ls /user/kofeng/partitioned_table
drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
/user/kofeng/partitioned_table/20170507

-- insert overwrite this partition, and the specific location folder gone, data 
is missing, job is success by attaching _SUCCESS
spark-sql> insert overwrite table kofeng.partitioned_table 
partition(dt='20170507') select 123 as id, "kofeng" as name;

$ hadoop fs -ls /user/kofeng/partitioned_table
-rw-r--r--   3 kofeng hdmi-technology  0 2017-05-08 17:06 
/user/kofeng/partitioned_table/_SUCCESS



-- Then drop this partition and use hive to add partition and insert overwrite 
this partition data, then verify:
spark-sql> alter table kofeng.partitioned_table drop if exists 
partition(dt='20170507');

hive> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
OK

-- could see hive also drop the specific location but data is preserved on 
auto-created partition folder

hive> insert overwrite table

[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location

2017-05-08 Thread kobefeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kobefeng updated SPARK-20663:
-
Description: 
Use spark sql to create partition table first, and alter table by adding 
partition on specific location, then insert overwrite into this partition by 
selection, which will cause data missing compared with HIVE.

{code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
-- create partition table first
$ hadoop fs -mkdir /user/kofeng/partitioned_table
$ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
spark-sql> create table kofeng.partitioned_table(
 > id bigint,
 > name string,
 > dt string
 > ) using parquet options ('compression'='snappy', 
'path'='/user/kofeng/partitioned_table')
 > partitioned by (dt);

-- add partition with specific location
spark-sql> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';

$ hadoop fs -ls /user/kofeng/partitioned_table
drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
/user/kofeng/partitioned_table/20170507

-- insert overwrite this partition, and the specific location folder gone, data 
is missing, job is success by attaching _SUCCESS
spark-sql> insert overwrite table kofeng.partitioned_table 
partition(dt='20170507') select 123 as id, "kofeng" as name;

$ hadoop fs -ls /user/kofeng/partitioned_table
-rw-r--r--   3 kofeng hdmi-technology  0 2017-05-08 17:06 
/user/kofeng/partitioned_table/_SUCCESS



-- Then drop this partition and use hive to add partition and insert overwrite 
this partition data, then verify:
spark-sql> alter table kofeng.partitioned_table drop if exists 
partition(dt='20170507');

hive> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
OK

-- could see hive also drop the specific location but data is preserved on 
auto-created partition folder

hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') 
select 123 as id, "kofeng" as name from kofeng.test;
Loading data to table kofeng.partitioned_table partition (dt=20170507)
Moved: '/user/kofeng/partitioned_table/dt=20170507/00_0' to trash 
at: /user/kofeng/.Trash/Current
Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, 
numRows=1, totalSize=338, rawDataSize=2]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2   Cumulative CPU: 10.61 sec   HDFS Read: 9767 
HDFS Write: 577 SUCCESS
Stage-Stage-3: Map: 1   Cumulative CPU: 12.36 sec   HDFS Read: 3635 
HDFS Write: 338 SUCCESS

hive> select * from kofeng.partitioned_table;
OK
123 kofeng  20170507

$ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507
-rwxr-xr-x   3 kofeng hdmi-technology338 2017-05-08 17:26 
/user/kofeng/partitioned_table/dt=20170507/00_0
{code}

  was:
Use spark sql to create partition table first, and alter table by adding 
partition on specific location, then insert overwrite into this partition by 
selection, which will cause data missing compared with HIVE.

{code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
-- create partition table first
$ hadoop fs -mkdir /user/kofeng/partitioned_table
$ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
spark-sql> create table kofeng.partitioned_table(
 > id bigint,
 > name string,
 > dt string
 > ) using parquet options ('compression'='snappy', 
'path'='hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table')
 > partitioned by (dt);

-- add partition with specific location
spark-sql> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';

$ hadoop fs -ls /user/kofeng/partitioned_table
drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
/user/kofeng/partitioned_table/20170507

-- insert overwrite this partition, and the specific location folder gone, data 
is missing, job is success by attaching _SUCCESS
spark-sql> insert overwrite table kofeng.partitioned_table 
partition(dt='20170507') select 123 as id, "kofeng" as name;

$ hadoop fs -ls /user/kofeng/partitioned_table
-rw-r--r--   3 kofeng hdmi-technology  0 2017-05-08 17:06 
/user/kofeng/partitioned_table/_SUCCESS



-- Then drop this partition and use hive to add partition and insert overwrite 
this partition data, then verify:
spark-sql> alter table kofeng.partitioned_table drop if exists 
partition(dt='20170507');

hive> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
OK

-- could see hive also drop the specific location but data is preserved on 
auto-created partition folder

hive> insert overwrite table kofeng.partitioned_table

[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location

2017-05-08 Thread kobefeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kobefeng updated SPARK-20663:
-
Description: 
Use spark sql to create partition table first, and alter table by adding 
partition on specific location, then insert overwrite into this partition by 
selection, which will cause data missing compared with HIVE.

{code:title=Bar.sql|borderStyle=solid}
-- create partition table first
$ hadoop fs -mkdir /user/kofeng/partitioned_table
$ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
spark-sql> create table kofeng.partitioned_table(
 > id bigint,
 > name string,
 > dt string
 > ) using parquet options ('compression'='snappy', 
'path'='hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table')
 > partitioned by (dt);

-- add partition with specific location
spark-sql> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';

$ hadoop fs -ls /user/kofeng/partitioned_table
drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
/user/kofeng/partitioned_table/20170507

-- insert overwrite this partition, and the specific location folder gone, data 
is missing, job is success by attaching _SUCCESS
spark-sql> insert overwrite table kofeng.partitioned_table 
partition(dt='20170507') select 123 as id, "kofeng" as name;

$ hadoop fs -ls /user/kofeng/partitioned_table
-rw-r--r--   3 kofeng hdmi-technology  0 2017-05-08 17:06 
/user/kofeng/partitioned_table/_SUCCESS



-- Then drop this partition and use hive to add partition and insert overwrite 
this partition data, then verify:
spark-sql> alter table kofeng.partitioned_table drop if exists 
partition(dt='20170507');

hive> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
OK

-- could see hive also drop the specific location but data is preserved on 
auto-created partition folder

hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') 
select 123 as id, "kofeng" as name from kofeng.test;
Loading data to table kofeng.partitioned_table partition (dt=20170507)
Moved: 
'hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table/dt=20170507/00_0' to 
trash at: hdfs://ares-lvs-nn-ha/user/kofeng/.Trash/Current
Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, 
numRows=1, totalSize=338, rawDataSize=2]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2   Cumulative CPU: 10.61 sec   HDFS Read: 9767 
HDFS Write: 577 SUCCESS
Stage-Stage-3: Map: 1   Cumulative CPU: 12.36 sec   HDFS Read: 3635 
HDFS Write: 338 SUCCESS

hive> select * from kofeng.partitioned_table;
OK
123 kofeng  20170507

$ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507
-rwxr-xr-x   3 kofeng hdmi-technology338 2017-05-08 17:26 
/user/kofeng/partitioned_table/dt=20170507/00_0
{code}

  was:
Use spark sql to create partition table first, and alter table by adding 
partition on specific location, then insert overwrite into this partition by 
selection, which will cause data missing compared with HIVE.

bq.
-- create partition table first
$ hadoop fs -mkdir /user/kofeng/partitioned_table
$ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
spark-sql> create table kofeng.partitioned_table(
 > id bigint,
 > name string,
 > dt string
 > ) using parquet options ('compression'='snappy', 
'path'='hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table')
 > partitioned by (dt);

-- add partition with specific location
spark-sql> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';

$ hadoop fs -ls /user/kofeng/partitioned_table
drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
/user/kofeng/partitioned_table/20170507

-- insert overwrite this partition, and the specific location folder gone, data 
is missing, job is success by attaching _SUCCESS
spark-sql> insert overwrite table kofeng.partitioned_table 
partition(dt='20170507') select 123 as id, "kofeng" as name;

$ hadoop fs -ls /user/kofeng/partitioned_table
-rw-r--r--   3 kofeng hdmi-technology  0 2017-05-08 17:06 
/user/kofeng/partitioned_table/_SUCCESS



-- Then drop this partition and use hive to add partition and insert overwrite 
this partition data, then verify:
spark-sql> alter table kofeng.partitioned_table drop if exists 
partition(dt='20170507');

hive> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
OK

-- could see hive also drop the specific location but data is preserved on 
auto-created partition folder

hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') 
select 123

[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location

2017-05-08 Thread kobefeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kobefeng updated SPARK-20663:
-
Description: 
Use spark sql to create partition table first, and alter table by adding 
partition on specific location, then insert overwrite into this partition by 
selection, which will cause data missing compared with HIVE.

bq.
-- create partition table first
$ hadoop fs -mkdir /user/kofeng/partitioned_table
$ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
spark-sql> create table kofeng.partitioned_table(
 > id bigint,
 > name string,
 > dt string
 > ) using parquet options ('compression'='snappy', 
'path'='hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table')
 > partitioned by (dt);

-- add partition with specific location
spark-sql> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';

$ hadoop fs -ls /user/kofeng/partitioned_table
drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
/user/kofeng/partitioned_table/20170507

-- insert overwrite this partition, and the specific location folder gone, data 
is missing, job is success by attaching _SUCCESS
spark-sql> insert overwrite table kofeng.partitioned_table 
partition(dt='20170507') select 123 as id, "kofeng" as name;

$ hadoop fs -ls /user/kofeng/partitioned_table
-rw-r--r--   3 kofeng hdmi-technology  0 2017-05-08 17:06 
/user/kofeng/partitioned_table/_SUCCESS



-- Then drop this partition and use hive to add partition and insert overwrite 
this partition data, then verify:
spark-sql> alter table kofeng.partitioned_table drop if exists 
partition(dt='20170507');

hive> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
OK

-- could see hive also drop the specific location but data is preserved on 
auto-created partition folder

hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') 
select 123 as id, "kofeng" as name from kofeng.test;
Loading data to table kofeng.partitioned_table partition (dt=20170507)
Moved: 
'hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table/dt=20170507/00_0' to 
trash at: hdfs://ares-lvs-nn-ha/user/kofeng/.Trash/Current
Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, 
numRows=1, totalSize=338, rawDataSize=2]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2   Cumulative CPU: 10.61 sec   HDFS Read: 9767 
HDFS Write: 577 SUCCESS
Stage-Stage-3: Map: 1   Cumulative CPU: 12.36 sec   HDFS Read: 3635 
HDFS Write: 338 SUCCESS

hive> select * from kofeng.partitioned_table;
OK
123 kofeng  20170507

$ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507
-rwxr-xr-x   3 kofeng hdmi-technology338 2017-05-08 17:26 
/user/kofeng/partitioned_table/dt=20170507/00_0

  was:Use spark sql to create partition table first, and alter table by adding 
partition on specific location, then insert overwrite into this partition by 
selection, which will cause data missing compared with HIVE.


> Data missing after insert overwrite table partition which is created on 
> specific location
> -
>
> Key: SPARK-20663
> URL: https://issues.apache.org/jira/browse/SPARK-20663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: kobefeng
>Priority: Minor
>
> Use spark sql to create partition table first, and alter table by adding 
> partition on specific location, then insert overwrite into this partition by 
> selection, which will cause data missing compared with HIVE.
> bq.
> -- create partition table first
> $ hadoop fs -mkdir /user/kofeng/partitioned_table
> $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
> spark-sql> create table kofeng.partitioned_table(
>  > id bigint,
>  > name string,
>  > dt string
>  > ) using parquet options ('compression'='snappy', 
> 'path'='hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table')
>  > partitioned by (dt);
> -- add partition with specific location
> spark-sql> alter table kofeng.partitioned_table add if not exists 
> partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
> $ hadoop fs -ls /user/kofeng/partitioned_table
> drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
> /user/kofeng/partitioned_table/20170507
> -- insert overwrite this partition, and the specific location folder gone, 
> data is missing, job is success by attaching _SUCCESS
> spark-sql> insert overwrite table kofeng.partitioned_table 
> partition(dt='20170507') select 123 as id, "kofeng" as name;
> $ hadoop fs -ls

[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location

2017-05-08 Thread kobefeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kobefeng updated SPARK-20663:
-
Docs Text:   (was: 
-- create partition table first
$ hadoop fs -mkdir /user/kofeng/partitioned_table
$ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
spark-sql> create table kofeng.partitioned_table(
 > id bigint,
 > name string,
 > dt string
 > ) using parquet options ('compression'='snappy', 
'path'='hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table')
 > partitioned by (dt);

-- add partition with specific location
spark-sql> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';

$ hadoop fs -ls /user/kofeng/partitioned_table
drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
/user/kofeng/partitioned_table/20170507

-- insert overwrite this partition, and the specific location folder gone, data 
is missing, job is success by attaching _SUCCESS
spark-sql> insert overwrite table kofeng.partitioned_table 
partition(dt='20170507') select 123 as id, "kofeng" as name;

$ hadoop fs -ls /user/kofeng/partitioned_table
-rw-r--r--   3 kofeng hdmi-technology  0 2017-05-08 17:06 
/user/kofeng/partitioned_table/_SUCCESS



-- Then drop this partition and use hive to add partition and insert overwrite 
this partition data, then verify:
spark-sql> alter table kofeng.partitioned_table drop if exists 
partition(dt='20170507');

hive> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
OK

-- could see hive also drop the specific location but data is preserved on 
auto-created partition folder

hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') 
select 123 as id, "kofeng" as name from kofeng.test;
Loading data to table kofeng.partitioned_table partition (dt=20170507)
Moved: 
'hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table/dt=20170507/00_0' to 
trash at: hdfs://ares-lvs-nn-ha/user/kofeng/.Trash/Current
Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, 
numRows=1, totalSize=338, rawDataSize=2]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2   Cumulative CPU: 10.61 sec   HDFS Read: 9767 
HDFS Write: 577 SUCCESS
Stage-Stage-3: Map: 1   Cumulative CPU: 12.36 sec   HDFS Read: 3635 
HDFS Write: 338 SUCCESS

hive> select * from kofeng.partitioned_table;
OK
123 kofeng  20170507

$ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507
-rwxr-xr-x   3 kofeng hdmi-technology338 2017-05-08 17:26 
/user/kofeng/partitioned_table/dt=20170507/00_0)

> Data missing after insert overwrite table partition which is created on 
> specific location
> -
>
> Key: SPARK-20663
> URL: https://issues.apache.org/jira/browse/SPARK-20663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: kobefeng
>Priority: Minor
>
> Use spark sql to create partition table first, and alter table by adding 
> partition on specific location, then insert overwrite into this partition by 
> selection, which will cause data missing compared with HIVE.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location

2017-05-08 Thread kobefeng (JIRA)

kobefeng created SPARK-20663:


 Summary: Data missing after insert overwrite table partition which 
is created on specific location
 Key: SPARK-20663
 URL: https://issues.apache.org/jira/browse/SPARK-20663
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1, 2.1.0
Reporter: kobefeng
Priority: Minor


Use spark sql to create partition table first, and alter table by adding 
partition on specific location, then insert overwrite into this partition by 
selection, which will cause data missing compared with HIVE.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20638) Optimize the CartesianRDD to reduce repeatedly data fetching

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20638:


Assignee: (was: Apache Spark)

> Optimize the CartesianRDD to reduce repeatedly data fetching
> 
>
> Key: SPARK-20638
> URL: https://issues.apache.org/jira/browse/SPARK-20638
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Teng Jiang
>
> In CartesianRDD, group each iterator to multiple groups. Thus in the second 
> iteration, the data with be fetched (num of data)/groupSize times, rather 
> than (num of data) times.
> The test results are:
> Test Environment : 3 workers, each has 10 cores, 30G memory, 1 executor
> Test data : users : 480,189, each is a 10-dim vector, and items : 17770, each 
> is a 10-dim vector.
> With default CartesianRDD, cartesian time is 2420.7s.
> With this proposal, cartesian time is 45.3s
> 50x faster than the original method.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20638) Optimize the CartesianRDD to reduce repeatedly data fetching

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20638:


Assignee: Apache Spark

> Optimize the CartesianRDD to reduce repeatedly data fetching
> 
>
> Key: SPARK-20638
> URL: https://issues.apache.org/jira/browse/SPARK-20638
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Teng Jiang
>Assignee: Apache Spark
>
> In CartesianRDD, group each iterator to multiple groups. Thus in the second 
> iteration, the data with be fetched (num of data)/groupSize times, rather 
> than (num of data) times.
> The test results are:
> Test Environment : 3 workers, each has 10 cores, 30G memory, 1 executor
> Test data : users : 480,189, each is a 10-dim vector, and items : 17770, each 
> is a 10-dim vector.
> With default CartesianRDD, cartesian time is 2420.7s.
> With this proposal, cartesian time is 45.3s
> 50x faster than the original method.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect

2017-05-08 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001811#comment-16001811
 ] 

Marcelo Vanzin commented on SPARK-20658:


Ok, so it's not an issue with old YARN jars being used. Will need to take a 
closer look, probably later this week...

> spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
> 
>
> Key: SPARK-20658
> URL: https://issues.apache.org/jira/browse/SPARK-20658
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Paul Jones
>Priority: Minor
>
> I'm running a job in YARN cluster mode using 
> `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both 
> spark-default.conf and in my spark-submit command. (This flag shows up in the 
> environment tab of spark history server, so it seems that it's specified 
> correctly). 
> However, I just had a job die with with four AM failures (three of the four 
> failures were over an hour apart). So, I'm confused as to what could be going 
> on. I haven't figured out the cause of the individual failures, so is it 
> possible that we always count certain types of failures? E.g. jobs that are 
> killed due to memory issues always count? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect

2017-05-08 Thread Paul Jones (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001803#comment-16001803
 ] 

Paul Jones commented on SPARK-20658:


The jars are versioned 2.7.3.

Finally finished grepping through the logs. I didn't find that error message. 
The closest I found was:

{noformat}
applications/hadoop-yarn/yarn-yarn-resourcemanager-ip-10-0-15-75.log.2017-04-28-03.gz:2017-04-28
 03:37:33,051 INFO org.apache.hadoop.yar
n.server.resourcemanager.rmapp.RMAppImpl (IPC Server handler 34 on 8032): The 
attemptFailuresValidityInterval for the application: application_1493122281436_0
016 is 360.
{noformat}

> spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
> 
>
> Key: SPARK-20658
> URL: https://issues.apache.org/jira/browse/SPARK-20658
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Paul Jones
>Priority: Minor
>
> I'm running a job in YARN cluster mode using 
> `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both 
> spark-default.conf and in my spark-submit command. (This flag shows up in the 
> environment tab of spark history server, so it seems that it's specified 
> correctly). 
> However, I just had a job die with with four AM failures (three of the four 
> failures were over an hour apart). So, I'm confused as to what could be going 
> on. I haven't figured out the cause of the individual failures, so is it 
> possible that we always count certain types of failures? E.g. jobs that are 
> killed due to memory issues always count? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-08 Thread Mingliang Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001778#comment-16001778
 ] 

Mingliang Liu commented on SPARK-20608:
---

Quick question [~charliechen] : I think [~vanzin] is suggesting that we simply 
use the logical HDFS namespace instead of specific NNs. Say 
{{dfs.nameservices=mycluster}}, then {{hdfs://mycluster}} is what you need as 
NN endpoint, instead of the specific namenodes (e.g. 
{{dfs.ha.namenodes.mycluste=nn1,nn2}}, you want to use {{hdfs://nn1}} and 
{{hdfs://nn2}} directly).

> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> yarn.spark.access.namenodes should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if yarn.spark.access.namenodes includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
> the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-1449) Please delete old releases from mirroring system

2017-05-08 Thread Sebb (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebb reopened SPARK-1449:
-

The mirror system includes the following:

spark-1.6.2
spark-1.6.3
spark-2.0.1
spark-2.0.2
spark-2.1.0
spark-2.1.1

At least half of these are clearly superseded versions which should please be 
deleted.

> Please delete old releases from mirroring system
> 
>
> Key: SPARK-1449
> URL: https://issues.apache.org/jira/browse/SPARK-1449
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 0.8.1, 0.9.1, 0.9.2, 1.0.0, 1.0.1
>Reporter: Sebb
>Assignee: Patrick Wendell
> Fix For: 0.8.1, 0.9.1, 0.9.2, 1.0.0, 1.0.1
>
>
> To reduce the load on the ASF mirrors, projects are required to delete old 
> releases [1]
> Please can you remove all non-current releases?
> Thanks!
> [Note that older releases are always available from the ASF archive server]
> Any links to older releases on download pages should first be adjusted to 
> point to the archive server.
> [1] http://www.apache.org/dev/release.html#when-to-archive



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20600) KafkaRelation should be pretty printed in web UI (Details for Query)

2017-05-08 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001680#comment-16001680
 ] 

Shixiong Zhu edited comment on SPARK-20600 at 5/8/17 10:38 PM:
---

[~jlaskowski] Hope you can do it soon. Then we can put it into 2.2.0 if RC2 
fails.


was (Author: zsxwing):
[~jlaskowski] Hope you want do it soon. Then we can put it into 2.2.0 if RC2 
fails.

> KafkaRelation should be pretty printed in web UI (Details for Query)
> 
>
> Key: SPARK-20600
> URL: https://issues.apache.org/jira/browse/SPARK-20600
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
> Attachments: kafka-source-scan-webui.png
>
>
> Executing the following batch query gives the default stringified/internal 
> name of {{KafkaRelation}} in web UI (under Details for Query), i.e. 
> http://localhost:4040/SQL/execution/?id=3 (<-- change the {{id}}). See the 
> attachment.
> {code}
> spark.
>   read.
>   format("kafka").
>   option("subscribe", "topic1").
>   option("kafka.bootstrap.servers", "localhost:9092").
>   load.
>   select('value cast "string").
>   write.
>   csv("fromkafka.csv")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20600) KafkaRelation should be pretty printed in web UI (Details for Query)

2017-05-08 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001680#comment-16001680
 ] 

Shixiong Zhu commented on SPARK-20600:
--

[~jlaskowski] Hope you want do it soon. Then we can put it into 2.2.0 if RC2 
fails.

> KafkaRelation should be pretty printed in web UI (Details for Query)
> 
>
> Key: SPARK-20600
> URL: https://issues.apache.org/jira/browse/SPARK-20600
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
> Attachments: kafka-source-scan-webui.png
>
>
> Executing the following batch query gives the default stringified/internal 
> name of {{KafkaRelation}} in web UI (under Details for Query), i.e. 
> http://localhost:4040/SQL/execution/?id=3 (<-- change the {{id}}). See the 
> attachment.
> {code}
> spark.
>   read.
>   format("kafka").
>   option("subscribe", "topic1").
>   option("kafka.bootstrap.servers", "localhost:9092").
>   load.
>   select('value cast "string").
>   write.
>   csv("fromkafka.csv")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect

2017-05-08 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001638#comment-16001638
 ] 

Marcelo Vanzin commented on SPARK-20658:


That's different... what version of Hadoop libraries is part of the Spark build?

Generally there will be Hadoop jars in {{$SPARK_HOME/jars}}. Those are the ones 
that matter.

(Alternatively, if you found - or did not find - the log message I mentioned in 
your logs, that would have answered these questions already.)

> spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
> 
>
> Key: SPARK-20658
> URL: https://issues.apache.org/jira/browse/SPARK-20658
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Paul Jones
>Priority: Minor
>
> I'm running a job in YARN cluster mode using 
> `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both 
> spark-default.conf and in my spark-submit command. (This flag shows up in the 
> environment tab of spark history server, so it seems that it's specified 
> correctly). 
> However, I just had a job die with with four AM failures (three of the four 
> failures were over an hour apart). So, I'm confused as to what could be going 
> on. I haven't figured out the cause of the individual failures, so is it 
> possible that we always count certain types of failures? E.g. jobs that are 
> killed due to memory issues always count? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect

2017-05-08 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001638#comment-16001638
 ] 

Marcelo Vanzin edited comment on SPARK-20658 at 5/8/17 10:02 PM:
-

That's different... what version of Hadoop libraries is part of the Spark build?

Generally these will be Hadoop jars in {{$SPARK_HOME/jars}}. Those are the ones 
that matter.

(Alternatively, if you found - or did not find - the log message I mentioned in 
your logs, that would have answered these questions already.)


was (Author: vanzin):
That's different... what version of Hadoop libraries is part of the Spark build?

Generally there will be Hadoop jars in {{$SPARK_HOME/jars}}. Those are the ones 
that matter.

(Alternatively, if you found - or did not find - the log message I mentioned in 
your logs, that would have answered these questions already.)

> spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
> 
>
> Key: SPARK-20658
> URL: https://issues.apache.org/jira/browse/SPARK-20658
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Paul Jones
>Priority: Minor
>
> I'm running a job in YARN cluster mode using 
> `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both 
> spark-default.conf and in my spark-submit command. (This flag shows up in the 
> environment tab of spark history server, so it seems that it's specified 
> correctly). 
> However, I just had a job die with with four AM failures (three of the four 
> failures were over an hour apart). So, I'm confused as to what could be going 
> on. I haven't figured out the cause of the individual failures, so is it 
> possible that we always count certain types of failures? E.g. jobs that are 
> killed due to memory issues always count? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19268) File does not exist: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta

2017-05-08 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001631#comment-16001631
 ] 

Shixiong Zhu commented on SPARK-19268:
--

[~skrishna] could you provide your codes, or the output of 
"dataset.explain(true)", please? Perhaps there is another bug in aggregation.

> File does not exist: 
> /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta
> --
>
> Key: SPARK-19268
> URL: https://issues.apache.org/jira/browse/SPARK-19268
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
> Environment: - hadoop2.7
> - Java 7
>Reporter: liyan
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.1.1, 2.2.0
>
>
> bq. ./run-example sql.streaming.JavaStructuredKafkaWordCount 
> 192.168.3.110:9092 subscribe topic03
> when i run the spark example raises the following error:
> {quote}
> Exception in thread "main" 17/01/17 14:13:41 DEBUG ContextCleaner: Got 
> cleaning task CleanBroadcast(4)
> org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
> stage failure: Task 2 in stage 9.0 failed 1 times, most recent failure: Lost 
> task 2.0 in stage 9.0 (TID 46, localhost, executor driver): 
> java.lang.IllegalStateException: Error reading delta file 
> /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta of 
> HDFSStateStoreProvider[id = (op=0, part=2), dir = 
> /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2]: 
> /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta does 
> not exist
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:354)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:306)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:303)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:303)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:302)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:302)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:151)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.FileNotFoundException: File does not exist: 
> /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta
>   at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
>   at 
>

[jira] [Commented] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect

2017-05-08 Thread Paul Jones (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001628#comment-16001628
 ] 

Paul Jones commented on SPARK-20658:


Ah...  This is using Amazons version of Hadoop 2.7.3

{noformat}
$ hadoop version
Hadoop 2.7.3-amzn-1
Subversion g...@aws157git.com:/pkg/Aws157BigTop -r 
30eccced8ce8c483445f0aa3175ce725831ff06b
Compiled by ec2-user on 2017-02-17T17:59Z
Compiled with protoc 2.5.0
>From source with checksum 1833aada17b94cfb94ad40ccd02d3df8
This command was run using /usr/lib/hadoop/hadoop-common-2.7.3-amzn-1.jar
{noformat}

> spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
> 
>
> Key: SPARK-20658
> URL: https://issues.apache.org/jira/browse/SPARK-20658
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Paul Jones
>Priority: Minor
>
> I'm running a job in YARN cluster mode using 
> `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both 
> spark-default.conf and in my spark-submit command. (This flag shows up in the 
> environment tab of spark history server, so it seems that it's specified 
> correctly). 
> However, I just had a job die with with four AM failures (three of the four 
> failures were over an hour apart). So, I'm confused as to what could be going 
> on. I haven't figured out the cause of the individual failures, so is it 
> possible that we always count certain types of failures? E.g. jobs that are 
> killed due to memory issues always count? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect

2017-05-08 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001623#comment-16001623
 ] 

Marcelo Vanzin commented on SPARK-20658:


That does not say which package you used (i.e. which version of Hadoop is 
packaged with your Spark build).

> spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
> 
>
> Key: SPARK-20658
> URL: https://issues.apache.org/jira/browse/SPARK-20658
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Paul Jones
>Priority: Minor
>
> I'm running a job in YARN cluster mode using 
> `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both 
> spark-default.conf and in my spark-submit command. (This flag shows up in the 
> environment tab of spark history server, so it seems that it's specified 
> correctly). 
> However, I just had a job die with with four AM failures (three of the four 
> failures were over an hour apart). So, I'm confused as to what could be going 
> on. I haven't figured out the cause of the individual failures, so is it 
> possible that we always count certain types of failures? E.g. jobs that are 
> killed due to memory issues always count? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20661) SparkR tableNames() test fails

2017-05-08 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-20661:


Assignee: Hossein Falaki

> SparkR tableNames() test fails
> --
>
> Key: SPARK-20661
> URL: https://issues.apache.org/jira/browse/SPARK-20661
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>Assignee: Hossein Falaki
>  Labels: test
> Fix For: 2.2.0
>
>
> Due to prior state created by other test cases, testing {{tableNames()}} is 
> failing in master.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect

2017-05-08 Thread Paul Jones (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001612#comment-16001612
 ] 

Paul Jones commented on SPARK-20658:


{noformat}
$ spark-submit --version

version 2.1.0
Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_121
Branch HEAD
Compiled by user ec2-user on 2017-02-17T19:03:33Z
Revision 30eccced8ce8c483445f0aa3175ce725831ff06b
Url g...@aws157git.com:/pkg/Aws157BigTop
{noformat}

> spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
> 
>
> Key: SPARK-20658
> URL: https://issues.apache.org/jira/browse/SPARK-20658
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Paul Jones
>Priority: Minor
>
> I'm running a job in YARN cluster mode using 
> `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both 
> spark-default.conf and in my spark-submit command. (This flag shows up in the 
> environment tab of spark history server, so it seems that it's specified 
> correctly). 
> However, I just had a job die with with four AM failures (three of the four 
> failures were over an hour apart). So, I'm confused as to what could be going 
> on. I haven't figured out the cause of the individual failures, so is it 
> possible that we always count certain types of failures? E.g. jobs that are 
> killed due to memory issues always count? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20661) SparkR tableNames() test fails

2017-05-08 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-20661.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17903
[https://github.com/apache/spark/pull/17903]

> SparkR tableNames() test fails
> --
>
> Key: SPARK-20661
> URL: https://issues.apache.org/jira/browse/SPARK-20661
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>  Labels: test
> Fix For: 2.2.0
>
>
> Due to prior state created by other test cases, testing {{tableNames()}} is 
> failing in master.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20630) Thread Dump link available in Executors tab irrespective of spark.ui.threadDumpsEnabled

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20630:


Assignee: (was: Apache Spark)

> Thread Dump link available in Executors tab irrespective of 
> spark.ui.threadDumpsEnabled
> ---
>
> Key: SPARK-20630
> URL: https://issues.apache.org/jira/browse/SPARK-20630
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Minor
> Attachments: spark-webui-executors-threadDump.png
>
>
> Irrespective of {{spark.ui.threadDumpsEnabled}} property web UI's Executors 
> page displays *Thread Dump* column with an active link (that does nothing 
> though).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20630) Thread Dump link available in Executors tab irrespective of spark.ui.threadDumpsEnabled

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20630:


Assignee: Apache Spark

> Thread Dump link available in Executors tab irrespective of 
> spark.ui.threadDumpsEnabled
> ---
>
> Key: SPARK-20630
> URL: https://issues.apache.org/jira/browse/SPARK-20630
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Minor
> Attachments: spark-webui-executors-threadDump.png
>
>
> Irrespective of {{spark.ui.threadDumpsEnabled}} property web UI's Executors 
> page displays *Thread Dump* column with an active link (that does nothing 
> though).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20630) Thread Dump link available in Executors tab irrespective of spark.ui.threadDumpsEnabled

2017-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001609#comment-16001609
 ] 

Apache Spark commented on SPARK-20630:
--

User 'ajbozarth' has created a pull request for this issue:
https://github.com/apache/spark/pull/17904

> Thread Dump link available in Executors tab irrespective of 
> spark.ui.threadDumpsEnabled
> ---
>
> Key: SPARK-20630
> URL: https://issues.apache.org/jira/browse/SPARK-20630
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Minor
> Attachments: spark-webui-executors-threadDump.png
>
>
> Irrespective of {{spark.ui.threadDumpsEnabled}} property web UI's Executors 
> page displays *Thread Dump* column with an active link (that does nothing 
> though).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20580) Allow RDD cache with unserializable objects

2017-05-08 Thread Fernando Pereira (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001607#comment-16001607
 ] 

Fernando Pereira commented on SPARK-20580:
--

I understand that at some point it will be better to fully implement 
serialization of our objects.
To be more precise in our use case, the objects are instances of Python 
extension types (implemented in Cython). Apparently by default they will 
serialize and deserialize with their basic structures, except not non-python 
data, like buffers, and therefore the "deserialized" objects are not valid.

My discussion here started since I found counter-intuitive that in some 
situations cache() may lead to the program to beak, I was looking for 
confirmation whether any operation following a map() will induce data 
deserialization (instead of trying to use the previous RDD data).
Any chance this behavior changes?

Thanks

> Allow RDD cache with unserializable objects
> ---
>
> Key: SPARK-20580
> URL: https://issues.apache.org/jira/browse/SPARK-20580
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Fernando Pereira
>Priority: Minor
>
> In my current scenario we load complex Python objects in the worker nodes 
> that are not completely serializable. We then apply map certain operations to 
> the RDD which at some point we collect. In this basic usage all works well.
> However, if we cache() the RDD (which defaults to memory) suddenly it fails 
> to execute the transformations after the caching step. Apparently caching 
> serializes the RDD data and deserializes it whenever more transformations are 
> required.
> It would be nice to avoid serialization of the objects if they are to be 
> cached to memory, and keep the original object



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20605) Deprecate not used AM and executor port configuration

2017-05-08 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-20605.

   Resolution: Fixed
 Assignee: Saisai Shao
Fix Version/s: 2.3.0

> Deprecate not used AM and executor port configuration
> -
>
> Key: SPARK-20605
> URL: https://issues.apache.org/jira/browse/SPARK-20605
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core, YARN
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.3.0
>
>
> After SPARK-10997, client mode Netty RpcEnv doesn't require to bind a port to 
> start server, so port configurations are not used any more, here propose to 
> remove these two configurations: "spark.executor.port" and "spark.am.port".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20662) Block jobs that have greater than a configured number of tasks

2017-05-08 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created SPARK-20662:
---

 Summary: Block jobs that have greater than a configured number of 
tasks
 Key: SPARK-20662
 URL: https://issues.apache.org/jira/browse/SPARK-20662
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.0.0, 1.6.0
Reporter: Xuefu Zhang


In a shared cluster, it's desirable for an admin to block large Spark jobs. 
While there might not be a single metrics defining the size of a job, the 
number of tasks is usually a good indicator. Thus, it would be useful for Spark 
scheduler to block a job whose number of tasks reaches a configured limit. By 
default, the limit could be just infinite, to retain the existing behavior.

MapReduce has mapreduce.job.max.map and mapreduce.job.max.reduce to be 
configured, which blocks a MR job at job submission time.

The proposed configuration is spark.job.max.tasks with a default value -1 
(infinite).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20500) ML, Graph 2.2 QA: API: Binary incompatible changes

2017-05-08 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001564#comment-16001564
 ] 

Joseph K. Bradley commented on SPARK-20500:
---

I'll take this one.

> ML, Graph 2.2 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-20500
> URL: https://issues.apache.org/jira/browse/SPARK-20500
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20500) ML, Graph 2.2 QA: API: Binary incompatible changes

2017-05-08 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20500:
-

Assignee: Joseph K. Bradley

> ML, Graph 2.2 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-20500
> URL: https://issues.apache.org/jira/browse/SPARK-20500
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7856) Scalable PCA implementation for tall and fat matrices

2017-05-08 Thread Ignacio Bermudez Corrales (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001551#comment-16001551
 ] 

Ignacio Bermudez Corrales edited comment on SPARK-7856 at 5/8/17 9:13 PM:
--

Apart from implementing Probabilistic PCA (which in my view is a different 
algorithm worth implementing as a separate algorithm), there are some issues 
with the current (2.3) implementation of 
RowMatrix.computePrincipalComponentsAndExplainedVariance that affect the PCA 
training.

In my opinion the Big problem with the current implementation is the line 387 
of RowMatrix.scala, which causes OutOfMemory exceptions for this kind of 
matrices, as it computes the covariance as a local breeze dense matrix.

 val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]]

The implementation computes a dense covariance local breeze matrix, which is 
not needed for the computation of the principal components nor explained 
variance.

In particular, RowMatrix provides a more optimized SVD decomposition. 
Therefore, principal components and variance can be derived from this 
implementation of the decomposition, by computing the (X - µ).computeSVD( k, 
false, 0).


was (Author: elghoto):
Apart from implementing Probabilistic PCA (which in my view is a different 
algorithm worth implementing as a separate algorithm), there are some issues 
with the current (2.11) implementation of 
RowMatrix.computePrincipalComponentsAndExplainedVariance that affect the PCA 
training.

In my opinion the Big problem with the current implementation is the line 387 
of RowMatrix.scala, which causes OutOfMemory exceptions for this kind of 
matrices, as it computes the covariance as a local breeze dense matrix.

 val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]]

The implementation computes a dense covariance local breeze matrix, which is 
not needed for the computation of the principal components nor explained 
variance.

In particular, RowMatrix provides a more optimized SVD decomposition. 
Therefore, principal components and variance can be derived from this 
implementation of the decomposition, by computing the (X - µ).computeSVD( k, 
false, 0).

> Scalable PCA implementation for tall and fat matrices
> -
>
> Key: SPARK-7856
> URL: https://issues.apache.org/jira/browse/SPARK-7856
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Tarek Elgamal
>
> Currently the PCA implementation has a limitation of fitting d^2 
> covariance/grammian matrix entries in memory (d is the number of 
> columns/dimensions of the matrix). We often need only the largest k principal 
> components. To make pca really scalable, I suggest an implementation where 
> the memory usage is proportional to the principal components k rather than 
> the full dimensionality d. 
> I suggest adopting the solution described in this paper that is published in 
> SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). 
> The paper offers an implementation for Probabilistic PCA (PPCA) which has 
> less memory and time complexity and could potentially scale to tall and fat 
> matrices rather than tall and skinny matrices that is supported by the 
> current PCA impelmentation. 
> Probablistic PCA could be potentially added to the set of algorithms 
> supported by MLlib and it does not necessarily replace the old PCA 
> implementation.
> PPCA implementation is adopted in Matlab's Statistics and Machine Learning 
> Toolbox (http://www.mathworks.com/help/stats/ppca.html)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7856) Scalable PCA implementation for tall and fat matrices

2017-05-08 Thread Ignacio Bermudez Corrales (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001551#comment-16001551
 ] 

Ignacio Bermudez Corrales commented on SPARK-7856:
--

Apart from implementing Probabilistic PCA (which in my view is a different 
algorithm worth implementing as a separate algorithm), there are some issues 
with the current (2.11) implementation of 
RowMatrix.computePrincipalComponentsAndExplainedVariance that affect the PCA 
training.

In my opinion the Big problem with the current implementation is the line 387 
of RowMatrix.scala, which causes OutOfMemory exceptions for this kind of 
matrices, as it computes the covariance as a local breeze dense matrix.

 val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]]

The implementation computes a dense covariance local breeze matrix, which is 
not needed for the computation of the principal components nor explained 
variance.

In particular, RowMatrix provides a more optimized SVD decomposition. 
Therefore, principal components and variance can be derived from this 
implementation of the decomposition, by computing the (X - µ).computeSVD( k, 
false, 0).

> Scalable PCA implementation for tall and fat matrices
> -
>
> Key: SPARK-7856
> URL: https://issues.apache.org/jira/browse/SPARK-7856
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Tarek Elgamal
>
> Currently the PCA implementation has a limitation of fitting d^2 
> covariance/grammian matrix entries in memory (d is the number of 
> columns/dimensions of the matrix). We often need only the largest k principal 
> components. To make pca really scalable, I suggest an implementation where 
> the memory usage is proportional to the principal components k rather than 
> the full dimensionality d. 
> I suggest adopting the solution described in this paper that is published in 
> SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). 
> The paper offers an implementation for Probabilistic PCA (PPCA) which has 
> less memory and time complexity and could potentially scale to tall and fat 
> matrices rather than tall and skinny matrices that is supported by the 
> current PCA impelmentation. 
> Probablistic PCA could be potentially added to the set of algorithms 
> supported by MLlib and it does not necessarily replace the old PCA 
> implementation.
> PPCA implementation is adopted in Matlab's Statistics and Machine Learning 
> Toolbox (http://www.mathworks.com/help/stats/ppca.html)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20661) SparkR tableNames() test fails

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20661:


Assignee: Apache Spark

> SparkR tableNames() test fails
> --
>
> Key: SPARK-20661
> URL: https://issues.apache.org/jira/browse/SPARK-20661
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>Assignee: Apache Spark
>  Labels: test
>
> Due to prior state created by other test cases, testing {{tableNames()}} is 
> failing in master.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20661) SparkR tableNames() test fails

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20661:


Assignee: (was: Apache Spark)

> SparkR tableNames() test fails
> --
>
> Key: SPARK-20661
> URL: https://issues.apache.org/jira/browse/SPARK-20661
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>  Labels: test
>
> Due to prior state created by other test cases, testing {{tableNames()}} is 
> failing in master.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20661) SparkR tableNames() test fails

2017-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001521#comment-16001521
 ] 

Apache Spark commented on SPARK-20661:
--

User 'falaki' has created a pull request for this issue:
https://github.com/apache/spark/pull/17903

> SparkR tableNames() test fails
> --
>
> Key: SPARK-20661
> URL: https://issues.apache.org/jira/browse/SPARK-20661
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>  Labels: test
>
> Due to prior state created by other test cases, testing {{tableNames()}} is 
> failing in master.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20661) SparkR tableNames() test fails

2017-05-08 Thread Hossein Falaki (JIRA)

Hossein Falaki created SPARK-20661:
--

 Summary: SparkR tableNames() test fails
 Key: SPARK-20661
 URL: https://issues.apache.org/jira/browse/SPARK-20661
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Hossein Falaki


Due to prior state created by other test cases, testing {{tableNames()}} is 
failing in master.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2846/console



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-05-08 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001474#comment-16001474
 ] 

Shixiong Zhu commented on SPARK-18057:
--

[~helena_e] I didn't mean for Spark. Even in Spark, the required code changes 
are in tests. I meant, as a Spark user, why you cannot add the Kafka client as 
a dependency and update the Kafka client? Because you have some test codes 
similar to Spark, or are you using Kafka API directly in your codes?

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20660) Not able to merge Dataframes with different column orders

2017-05-08 Thread Michel Lemay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michel Lemay updated SPARK-20660:
-
Description: 
Union on two dataframes with different column orders is not supported and lead 
to hard to find issues.

Here is an example showing the issue.
{code}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
var inputSchema = StructType(StructField("key", StringType, nullable=true) :: 
StructField("value", IntegerType, nullable=true) :: Nil)
var a = spark.createDataFrame(sc.parallelize((1 to 10)).map(x => 
Row(x.toString, 555)), inputSchema)
var b = a.select($"value" * 2 alias "value", $"key")  // any transformation 
changing column order will show the problem.
a.union(b).show

// in order to make it work, we need to reorder columns
val bCols = a.columns.map(aCol => b(aCol))
a.union(b.select(bCols:_*)).show
{code}

  was:
Union on two dataframes with different column orders is not supported and lead 
to hard to find issues.

Here is an example showing the issue.
{code}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
var inputSchema = StructType(StructField("key", StringType, nullable=true) :: 
StructField("value", IntegerType, nullable=true) :: Nil)
var a = spark.createDataFrame(sc.parallelize((1 to 10)).map(x => 
Row(x.toString, 555)), inputSchema)
var b = a.select($"value", $"key")  // any transformation changing column order 
will show the problem.
a.union(b).show

// in order to make it work, we need to reorder columns
val bCols = a.columns.map(aCol => b(aCol))
a.union(b.select(bCols:_*)).show
{code}


> Not able to merge Dataframes with different column orders
> -
>
> Key: SPARK-20660
> URL: https://issues.apache.org/jira/browse/SPARK-20660
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Michel Lemay
>Priority: Minor
>
> Union on two dataframes with different column orders is not supported and 
> lead to hard to find issues.
> Here is an example showing the issue.
> {code}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> var inputSchema = StructType(StructField("key", StringType, nullable=true) :: 
> StructField("value", IntegerType, nullable=true) :: Nil)
> var a = spark.createDataFrame(sc.parallelize((1 to 10)).map(x => 
> Row(x.toString, 555)), inputSchema)
> var b = a.select($"value" * 2 alias "value", $"key")  // any transformation 
> changing column order will show the problem.
> a.union(b).show
> // in order to make it work, we need to reorder columns
> val bCols = a.columns.map(aCol => b(aCol))
> a.union(b.select(bCols:_*)).show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-05-08 Thread Helena Edelson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000130#comment-16000130
 ] 

Helena Edelson edited comment on SPARK-18057 at 5/8/17 8:23 PM:


It's not that simple, the PR I have queued for this required some code changes 
in the upgrade. It's not just a dependency addition/exclusion.


was (Author: helena_e):
Did that a while ago, my only point is not modifying artifacts ideally, by 
adding and excluding in builds.

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20660) Not able to merge Dataframes with different column orders

2017-05-08 Thread Michel Lemay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michel Lemay updated SPARK-20660:
-
Description: 
Union on two dataframes with different column orders is not supported and lead 
to hard to find issues.

Here is an example showing the issue.
{code}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
var inputSchema = StructType(StructField("key", StringType, nullable=true) :: 
StructField("value", IntegerType, nullable=true) :: Nil)
var a = spark.createDataFrame(sc.parallelize((1 to 10)).map(x => 
Row(x.toString, 555)), inputSchema)
var b = a.select($"value", $"key")  // any transformation changing column order 
will show the problem.
a.union(b).show

// in order to make it work, we need to reorder columns
val bCols = a.columns.map(aCol => b(aCol))
a.union(b.select(bCols:_*)).show
{code}

  was:
Union on two dataframes with different column orders is not supported and lead 
to hard to find issues.

Here is an example showing the issue.
{code}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
var inputSchema = StructType(StructField("key", StringType, nullable=true) :: 
StructField("value", IntegerType, nullable=true) :: Nil)
var a = spark.createDataFrame(sc.parallelize((1 to 10)).map(x => 
Row(x.toString, 555)), inputSchema)
var b = a.select($"value", $"key")  // any transformation changing column order 
will show the problem.
a.union(c).show

// in order to make it work, we need to reorder columns
val bCols = a.columns.map(aCol => b(aCol))
a.union(b.select(bCols:_*)).show
{code}


> Not able to merge Dataframes with different column orders
> -
>
> Key: SPARK-20660
> URL: https://issues.apache.org/jira/browse/SPARK-20660
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Michel Lemay
>Priority: Minor
>
> Union on two dataframes with different column orders is not supported and 
> lead to hard to find issues.
> Here is an example showing the issue.
> {code}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> var inputSchema = StructType(StructField("key", StringType, nullable=true) :: 
> StructField("value", IntegerType, nullable=true) :: Nil)
> var a = spark.createDataFrame(sc.parallelize((1 to 10)).map(x => 
> Row(x.toString, 555)), inputSchema)
> var b = a.select($"value", $"key")  // any transformation changing column 
> order will show the problem.
> a.union(b).show
> // in order to make it work, we need to reorder columns
> val bCols = a.columns.map(aCol => b(aCol))
> a.union(b.select(bCols:_*)).show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20660) Not able to merge Dataframes with different column orders

2017-05-08 Thread Michel Lemay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michel Lemay updated SPARK-20660:
-
Description: 
Union on two dataframes with different column orders is not supported and lead 
to hard to find issues.

Here is an example showing the issue.
{code}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
var inputSchema = StructType(StructField("key", StringType, nullable=true) :: 
StructField("value", IntegerType, nullable=true) :: Nil)
var a = spark.createDataFrame(sc.parallelize((1 to 10)).map(x => 
Row(x.toString, 555)), inputSchema)
var b = a.select($"value", $"key")  // any transformation changing column order 
will show the problem.
a.union(c).show

// in order to make it work, we need to reorder columns
val bCols = a.columns.map(aCol => b(aCol))
a.union(b.select(bCols:_*)).show
{code}

  was:
Union on two dataframes with different column orders is not supported and lead 
to hard to find issues.

Here is an example showing the issue.
{code}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
var inputSchema = StructType(StructField("key", StringType, nullable=true) :: 
StructField("value", IntegerType, nullable=true) :: Nil)
var a = spark.createDataFrame(sc.parallelize((1 to 10 by 2)).map(x => 
Row(x.toString, 555)), inputSchema)
var b = a.select($"value", $"key")  // any transformation changing column order 
will show the problem.
a.union(c).show

// in order to make it work, we need to reorder columns
val bCols = a.columns.map(aCol => b(aCol))
a.union(b.select(bCols:_*)).show
{code}


> Not able to merge Dataframes with different column orders
> -
>
> Key: SPARK-20660
> URL: https://issues.apache.org/jira/browse/SPARK-20660
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Michel Lemay
>Priority: Minor
>
> Union on two dataframes with different column orders is not supported and 
> lead to hard to find issues.
> Here is an example showing the issue.
> {code}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> var inputSchema = StructType(StructField("key", StringType, nullable=true) :: 
> StructField("value", IntegerType, nullable=true) :: Nil)
> var a = spark.createDataFrame(sc.parallelize((1 to 10)).map(x => 
> Row(x.toString, 555)), inputSchema)
> var b = a.select($"value", $"key")  // any transformation changing column 
> order will show the problem.
> a.union(c).show
> // in order to make it work, we need to reorder columns
> val bCols = a.columns.map(aCol => b(aCol))
> a.union(b.select(bCols:_*)).show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20660) Not able to merge Dataframes with different column orders

2017-05-08 Thread Michel Lemay (JIRA)

Michel Lemay created SPARK-20660:


 Summary: Not able to merge Dataframes with different column orders
 Key: SPARK-20660
 URL: https://issues.apache.org/jira/browse/SPARK-20660
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Michel Lemay
Priority: Minor


Union on two dataframes with different column orders is not supported and lead 
to hard to find issues.

Here is an example showing the issue.
{code}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
var inputSchema = StructType(StructField("key", StringType, nullable=true) :: 
StructField("value", IntegerType, nullable=true) :: Nil)
var a = spark.createDataFrame(sc.parallelize((1 to 10 by 2)).map(x => 
Row(x.toString, 555)), inputSchema)
var b = a.select($"value", $"key")  // any transformation changing column order 
will show the problem.
a.union(c).show

// in order to make it work, we need to reorder columns
val bCols = a.columns.map(aCol => b(aCol))
a.union(b.select(bCols:_*)).show
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20659) Remove StorageStatus, or make it private.

2017-05-08 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-20659:
--

 Summary: Remove StorageStatus, or make it private.
 Key: SPARK-20659
 URL: https://issues.apache.org/jira/browse/SPARK-20659
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Marcelo Vanzin


With the work being done in SPARK-18085, StorageStatus is not used anymore by 
the UI. It's still used in a couple of other places, though:

- {{SparkContext.getExecutorStorageStatus}}
- {{BlockManagerSource}} (a metrics source)

Both could be changed to use the REST API types; the first one could be 
replaced with a new method in {{SparkStatusTracker}}, which I also think is a 
better place for it anyway.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect

2017-05-08 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001440#comment-16001440
 ] 

Marcelo Vanzin commented on SPARK-20658:


The exact build of Spark you're using would help. You can also check the logs 
for something like this:

{noformat}
Ignoring spark.yarn.am.attemptFailuresValidityInterval because the version of 
YARN does not support it
{noformat}

> spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
> 
>
> Key: SPARK-20658
> URL: https://issues.apache.org/jira/browse/SPARK-20658
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Paul Jones
>Priority: Minor
>
> I'm running a job in YARN cluster mode using 
> `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both 
> spark-default.conf and in my spark-submit command. (This flag shows up in the 
> environment tab of spark history server, so it seems that it's specified 
> correctly). 
> However, I just had a job die with with four AM failures (three of the four 
> failures were over an hour apart). So, I'm confused as to what could be going 
> on. I haven't figured out the cause of the individual failures, so is it 
> possible that we always count certain types of failures? E.g. jobs that are 
> killed due to memory issues always count? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect

2017-05-08 Thread Paul Jones (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001439#comment-16001439
 ] 

Paul Jones commented on SPARK-20658:


I know this likely isn't enough information to debug this issue. I'm happy to 
provide additionally information.

> spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect
> 
>
> Key: SPARK-20658
> URL: https://issues.apache.org/jira/browse/SPARK-20658
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Paul Jones
>Priority: Minor
>
> I'm running a job in YARN cluster mode using 
> `spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both 
> spark-default.conf and in my spark-submit command. (This flag shows up in the 
> environment tab of spark history server, so it seems that it's specified 
> correctly). 
> However, I just had a job die with with four AM failures (three of the four 
> failures were over an hour apart). So, I'm confused as to what could be going 
> on. I haven't figured out the cause of the individual failures, so is it 
> possible that we always count certain types of failures? E.g. jobs that are 
> killed due to memory issues always count? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20658) spark.yarn.am.attemptFailuresValidityInterval doesn't seem to have an effect

2017-05-08 Thread Paul Jones (JIRA)

Paul Jones created SPARK-20658:
--

 Summary: spark.yarn.am.attemptFailuresValidityInterval doesn't 
seem to have an effect
 Key: SPARK-20658
 URL: https://issues.apache.org/jira/browse/SPARK-20658
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.1.0
Reporter: Paul Jones
Priority: Minor


I'm running a job in YARN cluster mode using 
`spark.yarn.am.attemptFailuresValidityInterval=1h` specified in both 
spark-default.conf and in my spark-submit command. (This flag shows up in the 
environment tab of spark history server, so it seems that it's specified 
correctly). 

However, I just had a job die with with four AM failures (three of the four 
failures were over an hour apart). So, I'm confused as to what could be going 
on. I haven't figured out the cause of the individual failures, so is it 
possible that we always count certain types of failures? E.g. jobs that are 
killed due to memory issues always count? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20641) Key-value store abstraction and implementation for storing application data

2017-05-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001432#comment-16001432
 ] 

Apache Spark commented on SPARK-20641:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17902

> Key-value store abstraction and implementation for storing application data
> ---
>
> Key: SPARK-20641
> URL: https://issues.apache.org/jira/browse/SPARK-20641
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks adding a key-value store abstraction and initial LevelDB 
> implementation to be used to store application data for building the UI and 
> REST API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20641) Key-value store abstraction and implementation for storing application data

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20641:


Assignee: (was: Apache Spark)

> Key-value store abstraction and implementation for storing application data
> ---
>
> Key: SPARK-20641
> URL: https://issues.apache.org/jira/browse/SPARK-20641
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks adding a key-value store abstraction and initial LevelDB 
> implementation to be used to store application data for building the UI and 
> REST API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20641) Key-value store abstraction and implementation for storing application data

2017-05-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20641:


Assignee: Apache Spark

> Key-value store abstraction and implementation for storing application data
> ---
>
> Key: SPARK-20641
> URL: https://issues.apache.org/jira/browse/SPARK-20641
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks adding a key-value store abstraction and initial LevelDB 
> implementation to be used to store application data for building the UI and 
> REST API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-05-08 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001429#comment-16001429
 ] 

Ryan Blue commented on SPARK-12297:
---

The Impala team has been working with the Parquet community recently to update 
the Parquet spec so that we can distinguish between timestamp with/without time 
zone. I think once that's committed, we should just move off of the INT96 
timestamp and use the proper spec.

> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
>Assignee: Imran Rashid
> Fix For: 2.3.0
>
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val tblName = s"${tblPrefix}_$format"
>   spark.sql(s"DROP TABLE IF EXISTS $tblName")
>   spark.sql(
> raw"""CREATE TABLE $tblName (
>   |  ts timestamp
>   | )
>   | STORED AS $format
>  """.stripMargin)
>   rawData.write.insertInto(tblName)
> }
> rawData.write.json(s"${tblPrefix}_json")
> {code}
> Then I start a spark-shell in "America/New_York" timezone, and read the data 
> back from each table:
> {code}
> scala> spark.sql("select * from la_parquet").collect().foreach{println}
> [2016-01-01 02:50:59.123]
> [2016-01-01 01:49:59.123]
> [2016-01-01 03:39:59.123]
> [2016-01-01 04:29:59.123]
> scala> spark.sql("select * from la_textfile").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").join(spark.sql("select * from 
> la_textfile"), "ts").show()
> ++
> |  ts|
> ++
> |2015-12-31 23:50:...|
> |2015-12-31 22:49:...|
> |2016-01-01 00:39:...|
> |2016-01-01 01:29:...|
> ++
> scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
> "ts").show()
> +---+
> | ts|
> +---+
> +---+
> {code}
> The textfile and json based data shows the same times, and can be joined 
> against each other, while the times from the parquet data have changed (and 
> obviously joins fail).
> This is a big problem for any organization that may try to read the same data 
> (say in S3) with clusters in multiple timezones.  It can also be a nasty 
> surprise as an organization tries to migrate file formats.  Finally, its a 
> source of incompatibility between Hive, Impala, and Spark.
> HIVE-12767 aims to fix this by introducing a table property which indicates 
> the "storage timezone" for the table.  Spark should add the same to ensure 
> consistency between file formats, and with Hive & Impala.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-05-08 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001426#comment-16001426
 ] 

Shixiong Zhu commented on SPARK-13747:
--

[~revolucion09] The default dispatcher uses ForkJoinPool. See 
http://doc.akka.io/docs/akka/current/scala/dispatchers.html#Default_dispatcher 

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 208 matches

Mail list logo