date:20171102

[jira] [Commented] (SPARK-22423) Scala test source files like TestHiveSingleton.scala should be in scala source root

2017-11-02 Thread xubo245 (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237176#comment-16237176
 ] 

xubo245 commented on SPARK-22423:
-

OK, I will fix it.

> Scala test source files like TestHiveSingleton.scala should be in scala 
> source root
> ---
>
> Key: SPARK-22423
> URL: https://issues.apache.org/jira/browse/SPARK-22423
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: xubo245
>Priority: Minor
>
> The TestHiveSingleton.scala file should be in scala directory, not in java 
> directory



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22427) StackOverFlowError when using FPGrowth

2017-11-02 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237174#comment-16237174
 ] 

yuhao yang commented on SPARK-22427:


Could you please try to increase the stack size, E.g. with -Xss10m ? 

> StackOverFlowError when using FPGrowth
> --
>
> Key: SPARK-22427
> URL: https://issues.apache.org/jira/browse/SPARK-22427
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.2.0
> Environment: Centos Linux 3.10.0-327.el7.x86_64
> java 1.8.0.111
> spark 2.2.0
>Reporter: lyt
>Priority: Normal
>
> code part:
> val path = jobConfig.getString("hdfspath")
> val vectordata = sc.sparkContext.textFile(path)
> val finaldata = sc.createDataset(vectordata.map(obj => {
>   obj.split(" ")
> }).filter(arr => arr.length > 0)).toDF("items")
> val fpg = new FPGrowth()
> 
> fpg.setMinSupport(minSupport).setItemsCol("items").setMinConfidence(minConfidence)
> val train = fpg.fit(finaldata)
> print(train.freqItemsets.count())
> print(train.associationRules.count())
> train.save("/tmp/FPGModel")
> And encountered following exception:
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429)
>   at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836)
>   at org.apache.spark.sql.Dataset.count(Dataset.scala:2429)
>   at DataMining.FPGrowth$.runJob(FPGrowth.scala:116)
>   at DataMining.testFPG$.main(FPGrowth.scala:36)
>   at DataMining.testFPG.main(FPGrowth.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>

[jira] [Updated] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-11-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22211:

Target Version/s: 2.2.1, 2.3.0

> LimitPushDown optimization for FullOuterJoin generates wrong results
> 
>
> Key: SPARK-22211
> URL: https://issues.apache.org/jira/browse/SPARK-22211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>Reporter: Benyi Wang
>Priority: Major
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
> generate a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
> is selected, but at right side we have 100K rows including 999, the result 
> will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/10th chance to be 
> selected by CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl = shuffle(1 to 10).toDF("id")
> val dr = shuffle(1 to 10).toDF("id")
> println("data frame dl:")
> dl.explain
> println("data frame dr:")
> dr.explain
> val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)
> j.explain
> j.show(false)
> {code}
> {code}
> data frame dl:
> == Physical Plan ==
> LocalTableScan [id#10]
> data frame dr:
> == Physical Plan ==
> LocalTableScan [id#16]
> == Physical Plan ==
> CollectLimit 1
> +- SortMergeJoin [id#10], [id#16], FullOuter
>:- *Sort [id#10 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#10, 200)
>: +- *LocalLimit 1
>:+- LocalTableScan [id#10]
>+- *Sort [id#16 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#16, 200)
>  +- LocalTableScan [id#16]
> import scala.util.Random._
> dl: org.apache.spark.sql.DataFrame = [id: int]
> dr: org.apache.spark.sql.DataFrame = [id: int]
> j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]
> ++---+
> |id  |id |
> ++---+
> |null|148|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-11-02 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237133#comment-16237133
 ] 

Xiao Li commented on SPARK-22211:
-

Will submit a PR based on my previous PR 
https://github.com/apache/spark/pull/10454

> LimitPushDown optimization for FullOuterJoin generates wrong results
> 
>
> Key: SPARK-22211
> URL: https://issues.apache.org/jira/browse/SPARK-22211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>Reporter: Benyi Wang
>Priority: Major
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
> generate a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
> is selected, but at right side we have 100K rows including 999, the result 
> will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/10th chance to be 
> selected by CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl = shuffle(1 to 10).toDF("id")
> val dr = shuffle(1 to 10).toDF("id")
> println("data frame dl:")
> dl.explain
> println("data frame dr:")
> dr.explain
> val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)
> j.explain
> j.show(false)
> {code}
> {code}
> data frame dl:
> == Physical Plan ==
> LocalTableScan [id#10]
> data frame dr:
> == Physical Plan ==
> LocalTableScan [id#16]
> == Physical Plan ==
> CollectLimit 1
> +- SortMergeJoin [id#10], [id#16], FullOuter
>:- *Sort [id#10 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#10, 200)
>: +- *LocalLimit 1
>:+- LocalTableScan [id#10]
>+- *Sort [id#16 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#16, 200)
>  +- LocalTableScan [id#16]
> import scala.util.Random._
> dl: org.apache.spark.sql.DataFrame = [id: int]
> dr: org.apache.spark.sql.DataFrame = [id: int]
> j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]
> ++---+
> |id  |id |
> ++---+
> |null|148|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite

2017-11-02 Thread Nathan Kronenfeld (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237121#comment-16237121
 ] 

Nathan Kronenfeld commented on SPARK-22308:
---

ok, found the problem - it was the new tests, they weren't cleaning up after 
themselves.

Still trying to get past the hive issues that were keeping me from using maven 
in the first place, but should have this back to you in the next day or two.

> Support unit tests of spark code using ScalaTest using suites other than 
> FunSuite
> -
>
> Key: SPARK-22308
> URL: https://issues.apache.org/jira/browse/SPARK-22308
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Assignee: Nathan Kronenfeld
>Priority: Minor
>  Labels: scalatest, test-suite, test_issue
>
> External codebases that have spark code can test it using SharedSparkContext, 
> no matter how they write their scalatests - basing on FunSuite, FunSpec, 
> FlatSpec, or WordSpec.
> SharedSQLContext only supports FunSuite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22427) StackOverFlowError when using FPGrowth

2017-11-02 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237112#comment-16237112
 ] 

Kazuaki Ishizaki commented on SPARK-22427:
--

Thank you for reporting an issue. Could you please attach the data file? Or, 
can you write data size with a part of example data?

> StackOverFlowError when using FPGrowth
> --
>
> Key: SPARK-22427
> URL: https://issues.apache.org/jira/browse/SPARK-22427
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.2.0
> Environment: Centos Linux 3.10.0-327.el7.x86_64
> java 1.8.0.111
> spark 2.2.0
>Reporter: lyt
>Priority: Normal
>
> code part:
> val path = jobConfig.getString("hdfspath")
> val vectordata = sc.sparkContext.textFile(path)
> val finaldata = sc.createDataset(vectordata.map(obj => {
>   obj.split(" ")
> }).filter(arr => arr.length > 0)).toDF("items")
> val fpg = new FPGrowth()
> 
> fpg.setMinSupport(minSupport).setItemsCol("items").setMinConfidence(minConfidence)
> val train = fpg.fit(finaldata)
> print(train.freqItemsets.count())
> print(train.associationRules.count())
> train.save("/tmp/FPGModel")
> And encountered following exception:
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429)
>   at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836)
>   at org.apache.spark.sql.Dataset.count(Dataset.scala:2429)
>   at DataMining.FPGrowth$.runJob(FPGrowth.scala:116)
>   at DataMining.testFPG$.main(FPGrowth.scala:36)
>   at DataMining.testFPG.main(FPGrowth.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>

[jira] [Assigned] (SPARK-22254) clean up the implementation of `growToSize` in CompactBuffer

2017-11-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22254:


Assignee: Apache Spark

> clean up the implementation of `growToSize` in CompactBuffer
> 
>
> Key: SPARK-22254
> URL: https://issues.apache.org/jira/browse/SPARK-22254
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Feng Liu
>Assignee: Apache Spark
>Priority: Major
>
> two issues:
> 1. the arrayMax should be `ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH `
> 2. I believe some `-2` were introduced because `Integer.Max_Value` was used 
> previously. We should make the calculation of newArrayLen concise. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22254) clean up the implementation of `growToSize` in CompactBuffer

2017-11-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22254:


Assignee: (was: Apache Spark)

> clean up the implementation of `growToSize` in CompactBuffer
> 
>
> Key: SPARK-22254
> URL: https://issues.apache.org/jira/browse/SPARK-22254
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Feng Liu
>Priority: Major
>
> two issues:
> 1. the arrayMax should be `ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH `
> 2. I believe some `-2` were introduced because `Integer.Max_Value` was used 
> previously. We should make the calculation of newArrayLen concise. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22254) clean up the implementation of `growToSize` in CompactBuffer

2017-11-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237035#comment-16237035
 ] 

Apache Spark commented on SPARK-22254:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19650

> clean up the implementation of `growToSize` in CompactBuffer
> 
>
> Key: SPARK-22254
> URL: https://issues.apache.org/jira/browse/SPARK-22254
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Feng Liu
>Priority: Major
>
> two issues:
> 1. the arrayMax should be `ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH `
> 2. I believe some `-2` were introduced because `Integer.Max_Value` was used 
> previously. We should make the calculation of newArrayLen concise. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21791) ORC should support column names with dot

2017-11-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237095#comment-16237095
 ] 

Apache Spark commented on SPARK-21791:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19651

> ORC should support column names with dot
> 
>
> Key: SPARK-21791
> URL: https://issues.apache.org/jira/browse/SPARK-21791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> *PARQUET*
> {code}
> scala> Seq(Some(1), None).toDF("col.dots").write.parquet("/tmp/parquet_dot")
> scala> spark.read.parquet("/tmp/parquet_dot").show
> ++
> |col.dots|
> ++
> |   1|
> |null|
> ++
> {code}
> *ORC*
> {code}
> scala> Seq(Some(1), None).toDF("col.dots").write.orc("/tmp/orc_dot")
> scala> spark.read.orc("/tmp/orc_dot").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '.' expecting ':'(line 1, pos 10)
> == SQL ==
> struct
> --^^^
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20682) Add new ORCFileFormat based on Apache ORC

2017-11-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237093#comment-16237093
 ] 

Apache Spark commented on SPARK-20682:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19651

> Add new ORCFileFormat based on Apache ORC
> -
>
> Key: SPARK-20682
> URL: https://issues.apache.org/jira/browse/SPARK-20682
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module 
> with Hive dependency. This issue aims to add a new and faster ORC data source 
> inside `sql/core` and to replace the old ORC data source eventually. In this 
> issue, the latest Apache ORC 1.4.0 (released yesterday) is used.
> There are four key benefits.
> - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is 
> faster than the current implementation in Spark.
> - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC 
> community more.
> - Usability: User can use `ORC` data sources without hive module, i.e, 
> `-Phive`.
> - Maintainability: Reduce the Hive dependency and can remove old legacy code 
> later.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe

2017-11-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237094#comment-16237094
 ] 

Apache Spark commented on SPARK-15474:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19651

>  ORC data source fails to write and read back empty dataframe
> -
>
> Key: SPARK-15474
> URL: https://issues.apache.org/jira/browse/SPARK-15474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.1, 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently ORC data source fails to write and read empty data.
> The code below:
> {code}
> val emptyDf = spark.range(10).limit(0)
> emptyDf.write
>   .format("orc")
>   .save(path.getCanonicalPath)
> val copyEmptyDf = spark.read
>   .format("orc")
>   .load(path.getCanonicalPath)
> copyEmptyDf.show()
> {code}
> throws an exception below:
> {code}
> Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
> {code}
> Note that this is a different case with the data below
> {code}
> val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
> {code}
> In this case, any writer is not initialised and created. (no calls of 
> {{WriterContainer.writeRows()}}.
> For Parquet and JSON, it works but ORC does not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20682) Add new ORCFileFormat based on Apache ORC

2017-11-02 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20682:
--
Summary: Add new ORCFileFormat based on Apache ORC  (was: Support a new 
faster ORC data source based on Apache ORC)

> Add new ORCFileFormat based on Apache ORC
> -
>
> Key: SPARK-20682
> URL: https://issues.apache.org/jira/browse/SPARK-20682
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module 
> with Hive dependency. This issue aims to add a new and faster ORC data source 
> inside `sql/core` and to replace the old ORC data source eventually. In this 
> issue, the latest Apache ORC 1.4.0 (released yesterday) is used.
> There are four key benefits.
> - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is 
> faster than the current implementation in Spark.
> - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC 
> community more.
> - Usability: User can use `ORC` data sources without hive module, i.e, 
> `-Phive`.
> - Maintainability: Reduce the Hive dependency and can remove old legacy code 
> later.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22433) Linear regression R^2 train/test terminology related

2017-11-02 Thread Teng Peng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Peng updated SPARK-22433:
--
Description: 
Traditional statistics is traditional statistics. Their goal, framework, and 
terminologies are not the same as ML. However, in linear regression related 
components, this distinction is not clear, which is reflected:
1. regressionMetric + regressionEvaluator : 
* R2 shouldn't be there. 
* A better name "regressionPredictionMetric".

2. LinearRessionSuite: 
* Shouldn't test R2 and residuals on test data. 
* There is no train set and test set in this setting.

3. Terminology: there is no "linear regression with L1 regularization". Linear 
regression is linear. Adding a penalty term, then it is no longer linear. Just 
call it "LASSO", "ElasticNet".

There are more. I am working on correcting them.

They are not breaking anything, but it does not make one feel good to see the 
basic distinction is blurred.

  was:
Traditional statistics is traditional statistics. Their goal, framework, and 
terminologies are not the same as ML. However, in linear regression related 
components, this distinction is not clear, which is reflected:
1. regressionMetric + regressionEvaluator : 
* R2 shouldn't be there. 
* A better name "regressionPredictionMetric".

2. LinearregRessionSuite: 
* Shouldn't test R2 and residuals on test data. 
* There is no train set and test set in this setting.

3. Terminology: there is no "linear regression with L1 regularization". Linear 
regression is linear. Adding a penalty term, then it is no longer linear. Just 
call it "LASSO", "ElasticNet".

There are more. I am working on correcting them.

They are not breaking anything, but it does not make one feel good to see the 
basic distinction is blurred.


> Linear regression R^2 train/test terminology related 
> -
>
> Key: SPARK-22433
> URL: https://issues.apache.org/jira/browse/SPARK-22433
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Teng Peng
>Priority: Minor
>
> Traditional statistics is traditional statistics. Their goal, framework, and 
> terminologies are not the same as ML. However, in linear regression related 
> components, this distinction is not clear, which is reflected:
> 1. regressionMetric + regressionEvaluator : 
> * R2 shouldn't be there. 
> * A better name "regressionPredictionMetric".
> 2. LinearRessionSuite: 
> * Shouldn't test R2 and residuals on test data. 
> * There is no train set and test set in this setting.
> 3. Terminology: there is no "linear regression with L1 regularization". 
> Linear regression is linear. Adding a penalty term, then it is no longer 
> linear. Just call it "LASSO", "ElasticNet".
> There are more. I am working on correcting them.
> They are not breaking anything, but it does not make one feel good to see the 
> basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22433) Linear regression R^2 train/test terminology related

2017-11-02 Thread Teng Peng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Peng updated SPARK-22433:
--
Description: 
Traditional statistics is traditional statistics. Their goal, framework, and 
terminologies are not the same as ML. However, in linear regression related 
components, this distinction is not clear, which is reflected:
1. regressionMetric + regressionEvaluator : 
* R2 shouldn't be there. 
* A better name "regressionPredictionMetric".

2. LinearRegressionSuite: 
* Shouldn't test R2 and residuals on test data. 
* There is no train set and test set in this setting.

3. Terminology: there is no "linear regression with L1 regularization". Linear 
regression is linear. Adding a penalty term, then it is no longer linear. Just 
call it "LASSO", "ElasticNet".

There are more. I am working on correcting them.

They are not breaking anything, but it does not make one feel good to see the 
basic distinction is blurred.

  was:
Traditional statistics is traditional statistics. Their goal, framework, and 
terminologies are not the same as ML. However, in linear regression related 
components, this distinction is not clear, which is reflected:
1. regressionMetric + regressionEvaluator : 
* R2 shouldn't be there. 
* A better name "regressionPredictionMetric".

2. LinearRessionSuite: 
* Shouldn't test R2 and residuals on test data. 
* There is no train set and test set in this setting.

3. Terminology: there is no "linear regression with L1 regularization". Linear 
regression is linear. Adding a penalty term, then it is no longer linear. Just 
call it "LASSO", "ElasticNet".

There are more. I am working on correcting them.

They are not breaking anything, but it does not make one feel good to see the 
basic distinction is blurred.


> Linear regression R^2 train/test terminology related 
> -
>
> Key: SPARK-22433
> URL: https://issues.apache.org/jira/browse/SPARK-22433
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Teng Peng
>Priority: Minor
>
> Traditional statistics is traditional statistics. Their goal, framework, and 
> terminologies are not the same as ML. However, in linear regression related 
> components, this distinction is not clear, which is reflected:
> 1. regressionMetric + regressionEvaluator : 
> * R2 shouldn't be there. 
> * A better name "regressionPredictionMetric".
> 2. LinearRegressionSuite: 
> * Shouldn't test R2 and residuals on test data. 
> * There is no train set and test set in this setting.
> 3. Terminology: there is no "linear regression with L1 regularization". 
> Linear regression is linear. Adding a penalty term, then it is no longer 
> linear. Just call it "LASSO", "ElasticNet".
> There are more. I am working on correcting them.
> They are not breaking anything, but it does not make one feel good to see the 
> basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22433) Linear regression R^2 train/test terminology related

2017-11-02 Thread Teng Peng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Peng updated SPARK-22433:
--
Description: 
Traditional statistics is traditional statistics. Their goal, framework, and 
terminologies are not the same as ML. However, in linear regression related 
components, this distinction is not clear, which is reflected:
1. regressionMetric + regressionEvaluator : 
* R2 shouldn't be there. 
* A better name "regressionPredictionMetric".

2. LinearregRessionSuite: 
* Shouldn't test R2 and residuals on test data. 
* There is no train set and test set in this setting.

3. Terminology: there is no "linear regression with L1 regularization". Linear 
regression is linear. Adding a penalty term, then it is no longer linear. Just 
call it "LASSO", "ElasticNet".

There are more. I am working on correcting them.

They are not breaking anything, but it does not make one feel good to see the 
basic distinction is blurred.

  was:
Traditional statistics is traditional statistics. Their goal, framework, and 
terminologies are not the same as ML. However, in linear regression related 
components, this distinction is not clear, which is reflected:
1. regressionMetric + regressionEvaluator : 
* R2 shouldn't be there. 
* A better name "regressionPredictionMetric".
2. LinearregRessionSuite: 
* Shouldn't test R2 and residuals on test data. 
* There is no train set and test set in this setting.
3. Terminology: there is no "linear regression with L1 regularization". Linear 
regression is linear. Adding a penalty term, then it is no longer linear. Just 
call it "LASSO", "ElasticNet".

There are more. I am working on correcting them.

They are not breaking anything, but it does not make one feel good to see the 
basic distinction is blurred.


> Linear regression R^2 train/test terminology related 
> -
>
> Key: SPARK-22433
> URL: https://issues.apache.org/jira/browse/SPARK-22433
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Teng Peng
>Priority: Minor
>
> Traditional statistics is traditional statistics. Their goal, framework, and 
> terminologies are not the same as ML. However, in linear regression related 
> components, this distinction is not clear, which is reflected:
> 1. regressionMetric + regressionEvaluator : 
> * R2 shouldn't be there. 
> * A better name "regressionPredictionMetric".
> 2. LinearregRessionSuite: 
> * Shouldn't test R2 and residuals on test data. 
> * There is no train set and test set in this setting.
> 3. Terminology: there is no "linear regression with L1 regularization". 
> Linear regression is linear. Adding a penalty term, then it is no longer 
> linear. Just call it "LASSO", "ElasticNet".
> There are more. I am working on correcting them.
> They are not breaking anything, but it does not make one feel good to see the 
> basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22433) Linear regression R^2 train/test terminology related

2017-11-02 Thread Teng Peng (JIRA)

Teng Peng created SPARK-22433:
-

 Summary: Linear regression R^2 train/test terminology related 
 Key: SPARK-22433
 URL: https://issues.apache.org/jira/browse/SPARK-22433
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: Teng Peng
Priority: Minor


Traditional statistics is traditional statistics. Their goal, framework, and 
terminologies are not the same as ML. However, in linear regression related 
components, this distinction is not clear, which is reflected:
1. regressionMetric + regressionEvaluator : 
* R2 shouldn't be there. 
* A better name "regressionPredictionMetric".
2. LinearregRessionSuite: 
* Shouldn't test R2 and residuals on test data. 
* There is no train set and test set in this setting.
3. Terminology: there is no "linear regression with L1 regularization". Linear 
regression is linear. Adding a penalty term, then it is no longer linear. Just 
call it "LASSO", "ElasticNet".

There are more. I am working on correcting them.

They are not breaking anything, but it does not make one feel good to see the 
basic distinction is blurred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22405) Enrich the event information and add new event of ExternalCatalogEvent

2017-11-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236950#comment-16236950
 ] 

Apache Spark commented on SPARK-22405:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/19649

> Enrich the event information and add new event of ExternalCatalogEvent
> --
>
> Key: SPARK-22405
> URL: https://issues.apache.org/jira/browse/SPARK-22405
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> We're building a data lineage tool in which we need to monitor the metadata 
> changes in {{ExternalCatalog}}, current {{ExternalCatalog}} already provides 
> several useful events like "CreateDatabaseEvent" for custom SparkListener to 
> use. But the information provided by such event is not rich enough, for 
> example {{CreateTablePreEvent}} only provides "database" name and "table" 
> name, not all the table metadata, which is hard for user to get all the table 
> related useful information.
> So here propose to and new {{ExternalCatalogEvent}} and enrich the current 
> existing events for all the catalog related updates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22405) Enrich the event information and add new event of ExternalCatalogEvent

2017-11-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22405:


Assignee: (was: Apache Spark)

> Enrich the event information and add new event of ExternalCatalogEvent
> --
>
> Key: SPARK-22405
> URL: https://issues.apache.org/jira/browse/SPARK-22405
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> We're building a data lineage tool in which we need to monitor the metadata 
> changes in {{ExternalCatalog}}, current {{ExternalCatalog}} already provides 
> several useful events like "CreateDatabaseEvent" for custom SparkListener to 
> use. But the information provided by such event is not rich enough, for 
> example {{CreateTablePreEvent}} only provides "database" name and "table" 
> name, not all the table metadata, which is hard for user to get all the table 
> related useful information.
> So here propose to and new {{ExternalCatalogEvent}} and enrich the current 
> existing events for all the catalog related updates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22405) Enrich the event information and add new event of ExternalCatalogEvent

2017-11-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22405:


Assignee: Apache Spark

> Enrich the event information and add new event of ExternalCatalogEvent
> --
>
> Key: SPARK-22405
> URL: https://issues.apache.org/jira/browse/SPARK-22405
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> We're building a data lineage tool in which we need to monitor the metadata 
> changes in {{ExternalCatalog}}, current {{ExternalCatalog}} already provides 
> several useful events like "CreateDatabaseEvent" for custom SparkListener to 
> use. But the information provided by such event is not rich enough, for 
> example {{CreateTablePreEvent}} only provides "database" name and "table" 
> name, not all the table metadata, which is hard for user to get all the table 
> related useful information.
> So here propose to and new {{ExternalCatalogEvent}} and enrich the current 
> existing events for all the catalog related updates.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22426) Spark AM launching containers on node where External spark shuffle service failed to initialize

2017-11-02 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236939#comment-16236939
 ] 

Saisai Shao commented on SPARK-22426:
-

This kind of scenario was handled in SPARK-13669 regarding to blacklist 
mechanism.

> Spark AM launching containers on node where External spark shuffle service 
> failed to initialize
> ---
>
> Key: SPARK-22426
> URL: https://issues.apache.org/jira/browse/SPARK-22426
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>Priority: Major
>
> When Spark External Shuffle Service on a NodeManager fails, the remote 
> executors will fail while fetching the data from the executors launched on 
> this Node. Spark ApplicationMaster should not launch containers on this Node 
> or not use external shuffle service.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14516) Clustering evaluator

2017-11-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236915#comment-16236915
 ] 

Apache Spark commented on SPARK-14516:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/19648

> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.0
>
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21087) CrossValidator, TrainValidationSplit should collect all models when fitting: Scala API

2017-11-02 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-21087:
-

Assignee: Weichen Xu

> CrossValidator, TrainValidationSplit should collect all models when fitting: 
> Scala API
> --
>
> Key: SPARK-21087
> URL: https://issues.apache.org/jira/browse/SPARK-21087
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Major
>
> We add a parameter whether to collect the full model list when 
> CrossValidator/TrainValidationSplit training (Default is NOT, avoid the 
> change cause OOM)
> Add a method in CrossValidatorModel/TrainValidationSplitModel, allow user to 
> get the model list
> CrossValidatorModelWriter add a “option”, allow user to control whether to 
> persist the model list to disk.
> Note: when persisting the model list, use indices as the sub-model path



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21087) CrossValidator, TrainValidationSplit should collect all models when fitting: Scala API

2017-11-02 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21087:
--
Shepherd: Joseph K. Bradley

> CrossValidator, TrainValidationSplit should collect all models when fitting: 
> Scala API
> --
>
> Key: SPARK-21087
> URL: https://issues.apache.org/jira/browse/SPARK-21087
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Major
>
> We add a parameter whether to collect the full model list when 
> CrossValidator/TrainValidationSplit training (Default is NOT, avoid the 
> change cause OOM)
> Add a method in CrossValidatorModel/TrainValidationSplitModel, allow user to 
> get the model list
> CrossValidatorModelWriter add a “option”, allow user to control whether to 
> persist the model list to disk.
> Note: when persisting the model list, use indices as the sub-model path



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-11-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22211:


Assignee: (was: Apache Spark)

> LimitPushDown optimization for FullOuterJoin generates wrong results
> 
>
> Key: SPARK-22211
> URL: https://issues.apache.org/jira/browse/SPARK-22211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>Reporter: Benyi Wang
>Priority: Major
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
> generate a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
> is selected, but at right side we have 100K rows including 999, the result 
> will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/10th chance to be 
> selected by CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl = shuffle(1 to 10).toDF("id")
> val dr = shuffle(1 to 10).toDF("id")
> println("data frame dl:")
> dl.explain
> println("data frame dr:")
> dr.explain
> val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)
> j.explain
> j.show(false)
> {code}
> {code}
> data frame dl:
> == Physical Plan ==
> LocalTableScan [id#10]
> data frame dr:
> == Physical Plan ==
> LocalTableScan [id#16]
> == Physical Plan ==
> CollectLimit 1
> +- SortMergeJoin [id#10], [id#16], FullOuter
>:- *Sort [id#10 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#10, 200)
>: +- *LocalLimit 1
>:+- LocalTableScan [id#10]
>+- *Sort [id#16 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#16, 200)
>  +- LocalTableScan [id#16]
> import scala.util.Random._
> dl: org.apache.spark.sql.DataFrame = [id: int]
> dr: org.apache.spark.sql.DataFrame = [id: int]
> j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]
> ++---+
> |id  |id |
> ++---+
> |null|148|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-11-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236861#comment-16236861
 ] 

Apache Spark commented on SPARK-22211:
--

User 'henryr' has created a pull request for this issue:
https://github.com/apache/spark/pull/19647

> LimitPushDown optimization for FullOuterJoin generates wrong results
> 
>
> Key: SPARK-22211
> URL: https://issues.apache.org/jira/browse/SPARK-22211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>Reporter: Benyi Wang
>Priority: Major
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
> generate a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
> is selected, but at right side we have 100K rows including 999, the result 
> will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/10th chance to be 
> selected by CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl = shuffle(1 to 10).toDF("id")
> val dr = shuffle(1 to 10).toDF("id")
> println("data frame dl:")
> dl.explain
> println("data frame dr:")
> dr.explain
> val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)
> j.explain
> j.show(false)
> {code}
> {code}
> data frame dl:
> == Physical Plan ==
> LocalTableScan [id#10]
> data frame dr:
> == Physical Plan ==
> LocalTableScan [id#16]
> == Physical Plan ==
> CollectLimit 1
> +- SortMergeJoin [id#10], [id#16], FullOuter
>:- *Sort [id#10 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#10, 200)
>: +- *LocalLimit 1
>:+- LocalTableScan [id#10]
>+- *Sort [id#16 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#16, 200)
>  +- LocalTableScan [id#16]
> import scala.util.Random._
> dl: org.apache.spark.sql.DataFrame = [id: int]
> dr: org.apache.spark.sql.DataFrame = [id: int]
> j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]
> ++---+
> |id  |id |
> ++---+
> |null|148|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-11-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22211:


Assignee: Apache Spark

> LimitPushDown optimization for FullOuterJoin generates wrong results
> 
>
> Key: SPARK-22211
> URL: https://issues.apache.org/jira/browse/SPARK-22211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>Reporter: Benyi Wang
>Assignee: Apache Spark
>Priority: Major
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
> generate a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
> is selected, but at right side we have 100K rows including 999, the result 
> will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/10th chance to be 
> selected by CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl = shuffle(1 to 10).toDF("id")
> val dr = shuffle(1 to 10).toDF("id")
> println("data frame dl:")
> dl.explain
> println("data frame dr:")
> dr.explain
> val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)
> j.explain
> j.show(false)
> {code}
> {code}
> data frame dl:
> == Physical Plan ==
> LocalTableScan [id#10]
> data frame dr:
> == Physical Plan ==
> LocalTableScan [id#16]
> == Physical Plan ==
> CollectLimit 1
> +- SortMergeJoin [id#10], [id#16], FullOuter
>:- *Sort [id#10 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#10, 200)
>: +- *LocalLimit 1
>:+- LocalTableScan [id#10]
>+- *Sort [id#16 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#16, 200)
>  +- LocalTableScan [id#16]
> import scala.util.Random._
> dl: org.apache.spark.sql.DataFrame = [id: int]
> dr: org.apache.spark.sql.DataFrame = [id: int]
> j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]
> ++---+
> |id  |id |
> ++---+
> |null|148|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22429) Streaming checkpointing code does not retry after failure due to NullPointerException

2017-11-02 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-22429:
-
Component/s: (was: Structured Streaming)
 DStreams

> Streaming checkpointing code does not retry after failure due to 
> NullPointerException
> -
>
> Key: SPARK-22429
> URL: https://issues.apache.org/jira/browse/SPARK-22429
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Tristan Stevens
>
> CheckpointWriteHandler has a built in retry mechanism. However 
> SPARK-14930/SPARK-13693 put in a fix to de-allocate the fs object, yet 
> initialises it in the wrong place for the while loop, and therefore on 
> attempt 2 it fails with a NPE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22147) BlockId.hashCode allocates a StringBuilder/String on each call

2017-11-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236789#comment-16236789
 ] 

Apache Spark commented on SPARK-22147:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/19646

> BlockId.hashCode allocates a StringBuilder/String on each call
> --
>
> Key: SPARK-22147
> URL: https://issues.apache.org/jira/browse/SPARK-22147
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.2.0
>Reporter: Sergei Lebedev
>Assignee: Sergei Lebedev
>Priority: Minor
> Fix For: 2.3.0
>
>
> The base class {{BlockId}} 
> [defines|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockId.scala#L44]
>  {{hashCode}} and {{equals}} for all its subclasses in terms of {{name}}. 
> This makes the definitions of different ID types [very 
> concise|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockId.scala#L52].
>  The downside, however, is redundant allocations. While I don't think this 
> could be the major issue, it is still a bit disappointing to increase GC 
> pressure on the driver for nothing. For our machine learning workloads, we've 
> seen as much as 10% of all allocations on the driver coming from 
> {{BlockId.hashCode}} calls done for 
> [BlockManagerMasterEndpoint.blockLocations|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L54].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table

2017-11-02 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22306:

Fix Version/s: 2.3.0

> INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
> ---
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.2.1, 2.3.0
>
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}
> *REPRODUCE (branch-2.2)*
> In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch 
> is good due to SPARK-17729. This is a regression on Spark 2.2 only. By 
> default, Parquet Hive table is affected and only Hive may suffer from this.
> {code}
> hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
> INTO 10 BUCKETS STORED AS PARQUET;
> hive> INSERT INTO t VALUES('a','b');
> hive> DESC FORMATTED t;
> ...
> Num Buckets:  10
> Bucket Columns:   [a, b]
> Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)]
> scala> sql("SELECT * FROM t").show(false)
> hive> DESC FORMATTED t;
> Num Buckets:  -1
> Bucket Columns:   []
> Sort Columns: []
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22432) Allow long creation site to be logged for RDDs

2017-11-02 Thread Michael Mior (JIRA)

Michael Mior created SPARK-22432:


 Summary: Allow long creation site to be logged for RDDs
 Key: SPARK-22432
 URL: https://issues.apache.org/jira/browse/SPARK-22432
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Michael Mior


Would be interested in adding an option to store the long version of the RDD 
call site in the {{RDDInfo}} structure as opposed to the short one. This would 
allow the long version to appear in the event logs and the Spark UI and would 
be useful for debugging. I'm happy to submit a patch for this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22431) Creating Permanent view with illegal type

2017-11-02 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-22431:
-

 Summary: Creating Permanent view with illegal type
 Key: SPARK-22431
 URL: https://issues.apache.org/jira/browse/SPARK-22431
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Herman van Hovell
Priority: Major


It is possible in Spark SQL to create a permanent view that uses an nested 
field with an illegal name.

For example if we create the following view:
{noformat}
create view x as select struct('a' as `$q`, 1 as b) q
{noformat}
A simple select fails with the following exception:
{noformat}
select * from x;

org.apache.spark.SparkException: Cannot recognize hive type string: 
struct<$q:string,b:int>
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
...
{noformat}
Dropping the view isn't possible either:
{noformat}
drop view x;

org.apache.spark.SparkException: Cannot recognize hive type string: 
struct<$q:string,b:int>
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
...
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22430) Unknown tag warnings when building R docs with Roxygen 6.0.1

2017-11-02 Thread Joel Croteau (JIRA)

Joel Croteau created SPARK-22430:


 Summary: Unknown tag warnings when building R docs with Roxygen 
6.0.1
 Key: SPARK-22430
 URL: https://issues.apache.org/jira/browse/SPARK-22430
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.3.0
 Environment: Roxygen 6.0.1
Reporter: Joel Croteau


When building R docs using create-rd.sh with Roxygen 6.0.1, a large number of 
unknown tag warnings are generated:

{noformat}
Warning: @export [schema.R#33]: unknown tag
Warning: @export [schema.R#53]: unknown tag
Warning: @export [schema.R#63]: unknown tag
Warning: @export [schema.R#80]: unknown tag
Warning: @export [schema.R#123]: unknown tag
Warning: @export [schema.R#141]: unknown tag
Warning: @export [schema.R#216]: unknown tag
Warning: @export [generics.R#388]: unknown tag
Warning: @export [generics.R#403]: unknown tag
Warning: @export [generics.R#407]: unknown tag
Warning: @export [generics.R#414]: unknown tag
Warning: @export [generics.R#418]: unknown tag
Warning: @export [generics.R#422]: unknown tag
Warning: @export [generics.R#428]: unknown tag
Warning: @export [generics.R#432]: unknown tag
Warning: @export [generics.R#438]: unknown tag
Warning: @export [generics.R#442]: unknown tag
Warning: @export [generics.R#446]: unknown tag
Warning: @export [generics.R#450]: unknown tag
Warning: @export [generics.R#454]: unknown tag
Warning: @export [generics.R#459]: unknown tag
Warning: @export [generics.R#467]: unknown tag
Warning: @export [generics.R#475]: unknown tag
Warning: @export [generics.R#479]: unknown tag
Warning: @export [generics.R#483]: unknown tag
Warning: @export [generics.R#487]: unknown tag
Warning: @export [generics.R#498]: unknown tag
Warning: @export [generics.R#502]: unknown tag
Warning: @export [generics.R#506]: unknown tag
Warning: @export [generics.R#512]: unknown tag
Warning: @export [generics.R#518]: unknown tag
Warning: @export [generics.R#526]: unknown tag
Warning: @export [generics.R#530]: unknown tag
Warning: @export [generics.R#534]: unknown tag
Warning: @export [generics.R#538]: unknown tag
Warning: @export [generics.R#542]: unknown tag
Warning: @export [generics.R#549]: unknown tag
Warning: @export [generics.R#556]: unknown tag
Warning: @export [generics.R#560]: unknown tag
Warning: @export [generics.R#567]: unknown tag
Warning: @export [generics.R#571]: unknown tag
Warning: @export [generics.R#575]: unknown tag
Warning: @export [generics.R#579]: unknown tag
Warning: @export [generics.R#583]: unknown tag
Warning: @export [generics.R#587]: unknown tag
Warning: @export [generics.R#591]: unknown tag
Warning: @export [generics.R#595]: unknown tag
Warning: @export [generics.R#599]: unknown tag
Warning: @export [generics.R#603]: unknown tag
Warning: @export [generics.R#607]: unknown tag
Warning: @export [generics.R#611]: unknown tag
Warning: @export [generics.R#615]: unknown tag
Warning: @export [generics.R#619]: unknown tag
Warning: @export [generics.R#623]: unknown tag
Warning: @export [generics.R#627]: unknown tag
Warning: @export [generics.R#631]: unknown tag
Warning: @export [generics.R#635]: unknown tag
Warning: @export [generics.R#639]: unknown tag
Warning: @export [generics.R#643]: unknown tag
Warning: @export [generics.R#647]: unknown tag
Warning: @export [generics.R#654]: unknown tag
Warning: @export [generics.R#658]: unknown tag
Warning: @export [generics.R#663]: unknown tag
Warning: @export [generics.R#667]: unknown tag
Warning: @export [generics.R#672]: unknown tag
Warning: @export [generics.R#676]: unknown tag
Warning: @export [generics.R#680]: unknown tag
Warning: @export [generics.R#684]: unknown tag
Warning: @export [generics.R#690]: unknown tag
Warning: @export [generics.R#696]: unknown tag
Warning: @export [generics.R#702]: unknown tag
Warning: @export [generics.R#706]: unknown tag
Warning: @export [generics.R#710]: unknown tag
Warning: @export [generics.R#716]: unknown tag
Warning: @export [generics.R#720]: unknown tag
Warning: @export [generics.R#726]: unknown tag
Warning: @export [generics.R#730]: unknown tag
Warning: @export [generics.R#734]: unknown tag
Warning: @export [generics.R#738]: unknown tag
Warning: @export [generics.R#742]: unknown tag
Warning: @export [generics.R#750]: unknown tag
Warning: @export [generics.R#754]: unknown tag
Warning: @export [generics.R#758]: unknown tag
Warning: @export [generics.R#766]: unknown tag
Warning: @export [generics.R#770]: unknown tag
Warning: @export [generics.R#774]: unknown tag
Warning: @export [generics.R#778]: unknown tag
Warning: @export [generics.R#782]: unknown tag
Warning: @export [generics.R#786]: unknown tag
Warning: @export [generics.R#790]: unknown tag
Warning: @export [generics.R#794]: unknown tag
Warning: @export [generics.R#799]: unknown tag
Warning: @export [generics.R#803]: unknown tag
Warning: @export [generics.R#807]: unknown tag
Warning: @export [g

[jira] [Commented] (SPARK-22429) Streaming checkpointing code does not retry after failure due to NullPointerException

2017-11-02 Thread Tristan Stevens (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236492#comment-16236492
 ] 

Tristan Stevens commented on SPARK-22429:
-

[~srowen] I've raised a PR against branch-2.2. master would not compile for me 
(before I made changes), but the patch should apply cleanly on there too.

> Streaming checkpointing code does not retry after failure due to 
> NullPointerException
> -
>
> Key: SPARK-22429
> URL: https://issues.apache.org/jira/browse/SPARK-22429
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Tristan Stevens
>
> CheckpointWriteHandler has a built in retry mechanism. However 
> SPARK-14930/SPARK-13693 put in a fix to de-allocate the fs object, yet 
> initialises it in the wrong place for the while loop, and therefore on 
> attempt 2 it fails with a NPE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22429) Streaming checkpointing code does not retry after failure due to NullPointerException

2017-11-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22429:


Assignee: Apache Spark

> Streaming checkpointing code does not retry after failure due to 
> NullPointerException
> -
>
> Key: SPARK-22429
> URL: https://issues.apache.org/jira/browse/SPARK-22429
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Tristan Stevens
>Assignee: Apache Spark
>
> CheckpointWriteHandler has a built in retry mechanism. However 
> SPARK-14930/SPARK-13693 put in a fix to de-allocate the fs object, yet 
> initialises it in the wrong place for the while loop, and therefore on 
> attempt 2 it fails with a NPE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22429) Streaming checkpointing code does not retry after failure due to NullPointerException

2017-11-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22429:


Assignee: (was: Apache Spark)

> Streaming checkpointing code does not retry after failure due to 
> NullPointerException
> -
>
> Key: SPARK-22429
> URL: https://issues.apache.org/jira/browse/SPARK-22429
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Tristan Stevens
>
> CheckpointWriteHandler has a built in retry mechanism. However 
> SPARK-14930/SPARK-13693 put in a fix to de-allocate the fs object, yet 
> initialises it in the wrong place for the while loop, and therefore on 
> attempt 2 it fails with a NPE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22429) Streaming checkpointing code does not retry after failure due to NullPointerException

2017-11-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236489#comment-16236489
 ] 

Apache Spark commented on SPARK-22429:
--

User 'tmgstevens' has created a pull request for this issue:
https://github.com/apache/spark/pull/19645

> Streaming checkpointing code does not retry after failure due to 
> NullPointerException
> -
>
> Key: SPARK-22429
> URL: https://issues.apache.org/jira/browse/SPARK-22429
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Tristan Stevens
>
> CheckpointWriteHandler has a built in retry mechanism. However 
> SPARK-14930/SPARK-13693 put in a fix to de-allocate the fs object, yet 
> initialises it in the wrong place for the while loop, and therefore on 
> attempt 2 it fails with a NPE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22401) Missing 2.1.2 tag in git

2017-11-02 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-22401:
---

Assignee: holdenk

> Missing 2.1.2 tag in git
> 
>
> Key: SPARK-22401
> URL: https://issues.apache.org/jira/browse/SPARK-22401
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Deploy
>Affects Versions: 2.1.2
>Reporter: Brian Barker
>Assignee: holdenk
>Priority: Minor
> Fix For: 2.1.2
>
>
> We only saw a 2.1.2-rc4 tag in git, no official release. The releases web 
> page shows 2.1.2 was released in October 9.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22401) Missing 2.1.2 tag in git

2017-11-02 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-22401.
-
Resolution: Fixed

> Missing 2.1.2 tag in git
> 
>
> Key: SPARK-22401
> URL: https://issues.apache.org/jira/browse/SPARK-22401
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Deploy
>Affects Versions: 2.1.2
>Reporter: Brian Barker
>Assignee: holdenk
>Priority: Minor
> Fix For: 2.1.2
>
>
> We only saw a 2.1.2-rc4 tag in git, no official release. The releases web 
> page shows 2.1.2 was released in October 9.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22401) Missing 2.1.2 tag in git

2017-11-02 Thread Holden Karau (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-22401:
-
Fix Version/s: 2.1.2

> Missing 2.1.2 tag in git
> 
>
> Key: SPARK-22401
> URL: https://issues.apache.org/jira/browse/SPARK-22401
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Deploy
>Affects Versions: 2.1.2
>Reporter: Brian Barker
>Priority: Minor
> Fix For: 2.1.2
>
>
> We only saw a 2.1.2-rc4 tag in git, no official release. The releases web 
> page shows 2.1.2 was released in October 9.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22401) Missing 2.1.2 tag in git

2017-11-02 Thread Holden Karau (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236407#comment-16236407
 ] 

Holden Karau commented on SPARK-22401:
--

Pushed, looking at the scripts they are all for tagging the RCs.

> Missing 2.1.2 tag in git
> 
>
> Key: SPARK-22401
> URL: https://issues.apache.org/jira/browse/SPARK-22401
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Deploy
>Affects Versions: 2.1.2
>Reporter: Brian Barker
>Priority: Minor
> Fix For: 2.1.2
>
>
> We only saw a 2.1.2-rc4 tag in git, no official release. The releases web 
> page shows 2.1.2 was released in October 9.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20807) Add compression/decompression of data to ColumnVector

2017-11-02 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-20807.
--
Resolution: Won't Fix

> Add compression/decompression of data to ColumnVector
> -
>
> Key: SPARK-20807
> URL: https://issues.apache.org/jira/browse/SPARK-20807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> While current {{CachedBatch}} can compress data by using of of multiple 
> compression schemes, {{ColumnVector}} cannot compress data. It is mandatory 
> for table cache.
> This JIRA adds compression/decompression to {{ColumnVector}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21505) A dynamic join operator to improve the join reliability

2017-11-02 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236365#comment-16236365
 ] 

Zhan Zhang commented on SPARK-21505:


Any comments on this feature? Do you think the design is OK? If so, we are 
going to submit a PR.

> A dynamic join operator to improve the join reliability
> ---
>
> Key: SPARK-21505
> URL: https://issues.apache.org/jira/browse/SPARK-21505
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 3.0.0
>Reporter: Lin
>Priority: Major
>  Labels: features
>
> As we know, hash join is more efficient than sort merge join. But today hash 
> join is not so widely used because it may fail with OutOfMemory (OOM) error 
> due to limited memory resource, data skew, statistics mis-estimation and so 
> on. For example, if we apply shuffle hash join on an uneven distributed 
> dataset, some partitions might be so large  that we cannot make a Hash table 
> for this particular partition causing OOM error. When OOM happens, current 
> Spark technology will throw an Exception, resulting in job failure. On the 
> other hand, if sort-merge join is used, there will be shuffle, sorting and 
> extra spill, causing the degradation of the join. Considering the efficiency 
> of hash join, we want to propose a fallback mechanism to dynamically use hash 
> join or sort-merge join at runtime at task level to provide a more reliable 
> join operation.
> This new dynamic join operator internally implements the logic of HashJoin, 
> Iterator Reconstruct, Sort, and MergeJoin.  We show the process of this 
> dynamic join method as following:
> HashJoin: We start from building  Hash table on one side of join partitions. 
> If Hash table is built successfully, it would be the same as the current 
> ShuffledHashJoin operator. 
> Sort: If we fail to build Hash table due to the large partition size, we do 
> SortMergeJoin only on this partition. But we need to rebuild the   When OOM 
> happens, a Hash table corresponding to partial part of this partition has 
> been built successfully (e.g. first 4000 rows of RDD), and the iterator of 
> this partition is now pointing to the 4001st row of partition. We reuse this 
> hash table to reconstruct the iterator for the first 4000 rows and 
> concatenate  with the rest rows of this partition so that we can rebuild this 
> partition completely. On this re-built partition, we apply sorting based on 
> key values.
> MergeJoin: After getting two sorted Iterators, we perform regular merge join 
> against them and emits the records to downstream operators.
> Iterator Reconstruct:  BytesToBytesMap has to be spilled to disk to release 
> the memory for other operators, such as Sort, Join, etc. In addition, it has 
> to be converted to Iterator, so that it can be concatenated with remaining 
> items in the original iterator that is used to build the hash table.
> Meta Data Population: Necessary metadata, such as sorting keys, jointype, 
> etc,  has to be populated, so that they are used for potential Sort and 
> MergeJoin operator.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22243) streaming job failed to restart from checkpoint

2017-11-02 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-22243:


Assignee: StephenZou

> streaming job failed to restart from checkpoint
> ---
>
> Key: SPARK-22243
> URL: https://issues.apache.org/jira/browse/SPARK-22243
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0, 2.2.0
>Reporter: StephenZou
>Assignee: StephenZou
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: CheckpointTest.scala
>
>
> My spark-defaults.conf has an item related to the issue, I upload all jars in 
> spark's jars folder to the hdfs path:
> spark.yarn.jars  hdfs:///spark/cache/spark2.2/* 
> Streaming job failed to restart from checkpoint, ApplicationMaster throws  
> "Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher".  The problem is always 
> reproducible.
> I examine the sparkconf object recovered from checkpoint, and find 
> spark.yarn.jars are set empty, which let all jars not exist in AM side. The 
> solution is spark.yarn.jars should be reload from properties files when 
> recovering from checkpoint. 
> attach is a demo to reproduce the issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22243) streaming job failed to restart from checkpoint

2017-11-02 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-22243.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

> streaming job failed to restart from checkpoint
> ---
>
> Key: SPARK-22243
> URL: https://issues.apache.org/jira/browse/SPARK-22243
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0, 2.2.0
>Reporter: StephenZou
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: CheckpointTest.scala
>
>
> My spark-defaults.conf has an item related to the issue, I upload all jars in 
> spark's jars folder to the hdfs path:
> spark.yarn.jars  hdfs:///spark/cache/spark2.2/* 
> Streaming job failed to restart from checkpoint, ApplicationMaster throws  
> "Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher".  The problem is always 
> reproducible.
> I examine the sparkconf object recovered from checkpoint, and find 
> spark.yarn.jars are set empty, which let all jars not exist in AM side. The 
> solution is spark.yarn.jars should be reload from properties files when 
> recovering from checkpoint. 
> attach is a demo to reproduce the issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22429) Streaming checkpointing code does not retry after failure due to NullPointerException

2017-11-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236258#comment-16236258
 ] 

Sean Owen commented on SPARK-22429:
---

Sounds straightforward -- feel free to open a pull request.

> Streaming checkpointing code does not retry after failure due to 
> NullPointerException
> -
>
> Key: SPARK-22429
> URL: https://issues.apache.org/jira/browse/SPARK-22429
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Tristan Stevens
>
> CheckpointWriteHandler has a built in retry mechanism. However 
> SPARK-14930/SPARK-13693 put in a fix to de-allocate the fs object, yet 
> initialises it in the wrong place for the while loop, and therefore on 
> attempt 2 it fails with a NPE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22416) Move OrcOptions from `sql/hive` to `sql/core`

2017-11-02 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22416.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19636
[https://github.com/apache/spark/pull/19636]

> Move OrcOptions from `sql/hive` to `sql/core`
> -
>
> Key: SPARK-22416
> URL: https://issues.apache.org/jira/browse/SPARK-22416
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.3.0
>
>
> According to the 
> [discussion|https://github.com/apache/spark/pull/19571#issuecomment-339472976]
>  on SPARK-15474, we will add new OrcFileFormat in `sql/core` module.
> For that, `OrcOptions` should be visible like `private[sql]` in `sql/core` 
> module, too. Previously, it was `private[orc]` in `sql/hive`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22416) Move OrcOptions from `sql/hive` to `sql/core`

2017-11-02 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22416:
---

Assignee: Dongjoon Hyun

> Move OrcOptions from `sql/hive` to `sql/core`
> -
>
> Key: SPARK-22416
> URL: https://issues.apache.org/jira/browse/SPARK-22416
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.3.0
>
>
> According to the 
> [discussion|https://github.com/apache/spark/pull/19571#issuecomment-339472976]
>  on SPARK-15474, we will add new OrcFileFormat in `sql/core` module.
> For that, `OrcOptions` should be visible like `private[sql]` in `sql/core` 
> module, too. Previously, it was `private[orc]` in `sql/hive`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22254) clean up the implementation of `growToSize` in CompactBuffer

2017-11-02 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236105#comment-16236105
 ] 

Kazuaki Ishizaki commented on SPARK-22254:
--

I started working for this, and will submit a PR within a few days.

> clean up the implementation of `growToSize` in CompactBuffer
> 
>
> Key: SPARK-22254
> URL: https://issues.apache.org/jira/browse/SPARK-22254
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Feng Liu
>Priority: Major
>
> two issues:
> 1. the arrayMax should be `ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH `
> 2. I believe some `-2` were introduced because `Integer.Max_Value` was used 
> previously. We should make the calculation of newArrayLen concise. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22344) Prevent R CMD check from using /tmp

2017-11-02 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236103#comment-16236103
 ] 

Felix Cheung commented on SPARK-22344:
--

Yes to both. If SPARK_HOME is set before calling install.spark then we are not 
installing it.

Boy it's getting complicated.



> Prevent R CMD check from using /tmp
> ---
>
> Key: SPARK-22344
> URL: https://issues.apache.org/jira/browse/SPARK-22344
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0
>Reporter: Shivaram Venkataraman
>Priority: Major
>
> When R CMD check is run on the SparkR package it leaves behind files in /tmp 
> which is a violation of CRAN policy. We should instead write to Rtmpdir. 
> Notes from CRAN are below
> {code}
> Checking this leaves behind dirs
>hive/$USER
>$USER
> and files named like
>b4f6459b-0624-4100-8358-7aa7afbda757_resources
> in /tmp, in violation of the CRAN Policy.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22429) Streaming checkpointing code does not retry after failure due to NullPointerException

2017-11-02 Thread Tristan Stevens (JIRA)

Tristan Stevens created SPARK-22429:
---

 Summary: Streaming checkpointing code does not retry after failure 
due to NullPointerException
 Key: SPARK-22429
 URL: https://issues.apache.org/jira/browse/SPARK-22429
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0, 1.6.3
Reporter: Tristan Stevens


CheckpointWriteHandler has a built in retry mechanism. However 
SPARK-14930/SPARK-13693 put in a fix to de-allocate the fs object, yet 
initialises it in the wrong place for the while loop, and therefore on attempt 
2 it fails with a NPE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22329) Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default

2017-11-02 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-22329.
---
Resolution: Won't Fix

> Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default
> --
>
> Key: SPARK-22329
> URL: https://issues.apache.org/jira/browse/SPARK-22329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> In Spark 2.2.0, `spark.sql.hive.caseSensitiveInferenceMode` has a critical 
> issue. 
> - SPARK-19611 uses `INFER_AND_SAVE` at 2.2.0 since Spark 2.1.0 breaks some 
> Hive tables backed by case-sensitive data files.
> bq. This situation will occur for any Hive table that wasn't created by Spark 
> or that was created prior to Spark 2.1.0. If a user attempts to run a query 
> over such a table containing a case-sensitive field name in the query 
> projection or in the query filter, the query will return 0 results in every 
> case.
> - However, SPARK-22306 reports this also corrupts Hive Metastore schema by 
> removing bucketing information (BUCKETING_COLS, SORT_COLS) and changing owner.
> - Since Spark 2.3.0 supports Bucketing, BUCKETING_COLS and SORT_COLS look 
> okay at least. However, we need to figure out the issue of changing owners. 
> Also, we cannot backport bucketing patch into `branch-2.2`. We need more 
> tests on before releasing 2.3.0.
> Hive Metastore is a shared resource and Spark should not corrupt it by 
> default. This issue proposes to recover that option back to `NEVER_INFO` like 
> Spark 2.2.0 by default. Users can take a risk by enabling `INFER_AND_SAVE` by 
> themselves.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22369) PySpark: Document methods of spark.catalog interface

2017-11-02 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-22369.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.3.0

> PySpark: Document methods of spark.catalog interface
> 
>
> Key: SPARK-22369
> URL: https://issues.apache.org/jira/browse/SPARK-22369
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.0
>
>
> The following methods from the {{spark.catalog}} interface are not documented.
> {code:java}
> $ pyspark
> >>> dir(spark.catalog)
> ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', 
> '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', 
> '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', 
> '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', 
> '__str__', '__subclasshook__', '__weakref__', '_jcatalog', '_jsparkSession', 
> '_reset', '_sparkSession', 'cacheTable', 'clearCache', 'createExternalTable', 
> 'createTable', 'currentDatabase', 'dropGlobalTempView', 'dropTempView', 
> 'isCached', 'listColumns', 'listDatabases', 'listFunctions', 'listTables', 
> 'recoverPartitions', 'refreshByPath', 'refreshTable', 'registerFunction', 
> 'setCurrentDatabase', 'uncacheTable']
> {code}
> As a user I would like to have these methods documented on 
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.html . Old methods 
> of the SQLContext (e.g. {{pyspark.sql.SQLContext.cacheTable()}} vs. 
> {{pyspark.sql.SparkSession.catalog.cacheTable()}} or 
> {{pyspark.sql.HiveContext.refreshTable()}} vs. 
> {{pyspark.sql.SparkSession.catalog.refreshTable()}} ) should point to the new 
> method. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22419) Hive and Hive Thriftserver jars missing from "without hadoop" build

2017-11-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235783#comment-16235783
 ] 

Sean Owen commented on SPARK-22419:
---

Yes, it's useful for future reference. Spark should work fine with 2.6 and 
later. Honestly, the existence of a 2.6/2.7 build is vestigial at this point. 
You should not need your own build, in the main. Making your own build might 
help version conflicts, but really you're looking at a log4j config issue in 
that case.

> Hive and Hive Thriftserver jars missing from "without hadoop" build
> ---
>
> Key: SPARK-22419
> URL: https://issues.apache.org/jira/browse/SPARK-22419
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 2.1.1
>Reporter: Adam Kramer
>Priority: Minor
>
> The "without hadoop" binary distribution does not have hive-related libraries 
> in the jars directory.  This may be due to Hive being tied to major releases 
> of Hadoop. My project requires using Hadoop 2.8, so "without hadoop" version 
> seemed the best option. Should I use the make-distribution.sh instead?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22419) Hive and Hive Thriftserver jars missing from "without hadoop" build

2017-11-02 Thread Adam Kramer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235774#comment-16235774
 ] 

Adam Kramer edited comment on SPARK-22419 at 11/2/17 2:01 PM:
--

I'll assume it's on purpose for my stated reasons above. Apologies for not 
posting to the mailing list, but I have a feeling this could act as a good web 
reference from search, I rarely get results from the mailing list while 
troubleshooting in Google. Also, the documentation for using Spark with 
upgraded versions of Hadoop (e.g. 2.8) is definitely lacking or at best 
confusing (i.e. a binary version including a version of Hadoop libs can still 
be configured to use another version of Hadoop by following instruction from 
the "without hadoop" wiki page). I suspect those instructions are old, but when 
using SPARK_DIST_CLASSPATH to override the hadoop libraries you run into things 
like log4j.properties files being hijacked by Hadoop version that change your 
application logging altogether. My guess is that its something that likely 
worked well a while ago or in a very specific situation, thus requires a lot of 
trial and error.


was (Author: adamjk):
I'll assume it's on purpose for my stated reasons above. Apologies for not 
posting to the mailing list, but I have a feeling this could act as a good web 
reference from search, I rarely get results from the mailing list while 
troubleshooting in Google. Also, the documentation for using Spark with 
upgraded versions of Hadoop (e.g. 2.8) is definitely lacking or at best 
confusing (i.e. a binary version including a version of Hadoop libs can still 
be configured to use another version of Hadoop by following instruction from 
the "without hadoop" wiki page). I suspect those instructions are old, but when 
using SPARK_DIST_CLASSPATH to override the hadoop libraries you run into things 
like log4j.properties files being hijacked by Hadoop version that change your 
application logging altogether. My guess is that its something that likely 
worked well a while ago or in a very specific situation requires a lot of 
investigation.

> Hive and Hive Thriftserver jars missing from "without hadoop" build
> ---
>
> Key: SPARK-22419
> URL: https://issues.apache.org/jira/browse/SPARK-22419
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 2.1.1
>Reporter: Adam Kramer
>Priority: Minor
>
> The "without hadoop" binary distribution does not have hive-related libraries 
> in the jars directory.  This may be due to Hive being tied to major releases 
> of Hadoop. My project requires using Hadoop 2.8, so "without hadoop" version 
> seemed the best option. Should I use the make-distribution.sh instead?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22419) Hive and Hive Thriftserver jars missing from "without hadoop" build

2017-11-02 Thread Adam Kramer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235774#comment-16235774
 ] 

Adam Kramer commented on SPARK-22419:
-

I'll assume it's on purpose for my stated reasons above. Apologies for not 
posting to the mailing list, but I have a feeling this could act as a good web 
reference from search, I rarely get results from the mailing list while 
troubleshooting in Google. Also, the documentation for using Spark with 
upgraded versions of Hadoop (e.g. 2.8) is definitely lacking or at best 
confusing (i.e. a binary version including a version of Hadoop libs can still 
be configured to use another version of Hadoop by following instruction from 
the "without hadoop" wiki page). I suspect those instructions are old, but when 
using SPARK_DIST_CLASSPATH to override the hadoop libraries you run into things 
like log4j.properties files being hijacked by Hadoop version that change your 
application logging altogether. My guess is that its something that likely 
worked well a while ago or in a very specific situation requires a lot of 
investigation.

> Hive and Hive Thriftserver jars missing from "without hadoop" build
> ---
>
> Key: SPARK-22419
> URL: https://issues.apache.org/jira/browse/SPARK-22419
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 2.1.1
>Reporter: Adam Kramer
>Priority: Minor
>
> The "without hadoop" binary distribution does not have hive-related libraries 
> in the jars directory.  This may be due to Hive being tied to major releases 
> of Hadoop. My project requires using Hadoop 2.8, so "without hadoop" version 
> seemed the best option. Should I use the make-distribution.sh instead?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table

2017-11-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235725#comment-16235725
 ] 

Apache Spark commented on SPARK-22306:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19644

> INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
> ---
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.2.1
>
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}
> *REPRODUCE (branch-2.2)*
> In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch 
> is good due to SPARK-17729. This is a regression on Spark 2.2 only. By 
> default, Parquet Hive table is affected and only Hive may suffer from this.
> {code}
> hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
> INTO 10 BUCKETS STORED AS PARQUET;
> hive> INSERT INTO t VALUES('a','b');
> hive> DESC FORMATTED t;
> ...
> Num Buckets:  10
> Bucket Columns:   [a, b]
> Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)]
> scala> sql("SELECT * FROM t").show(false)
> hive> DESC FORMATTED t;
> Num Buckets:  -1
> Bucket Columns:   []
> Sort Columns: []
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22145) Issues with driver re-starting on mesos (supervise)

2017-11-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22145.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19374
[https://github.com/apache/spark/pull/19374]

> Issues with driver re-starting on mesos (supervise)
> ---
>
> Key: SPARK-22145
> URL: https://issues.apache.org/jira/browse/SPARK-22145
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
> Fix For: 2.3.0
>
>
> There are two issues with driver re-starting on mesos using the supervise 
> flag:
> - We need to add spark.mesos.driver.frameworkId to the reloaded properties 
> for checkpointing, otherwise the new frameworkId propagated by the dispatcher 
> will be overwritten by the checkpointed data.
> - Unique driver task ids are not used by the dispatcher:
> https://issues.apache.org/jira/browse/MESOS-4737
> https://issues.apache.org/jira/browse/MESOS-3070
> This issue is the same in principle as in the case with standalone mode where 
> the master needs to re-launch drivers with a new appId (driverId) to deal 
> with net partitions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22145) Issues with driver re-starting on mesos (supervise)

2017-11-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-22145:
-

Assignee: Stavros Kontopoulos

> Issues with driver re-starting on mesos (supervise)
> ---
>
> Key: SPARK-22145
> URL: https://issues.apache.org/jira/browse/SPARK-22145
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 2.3.0
>
>
> There are two issues with driver re-starting on mesos using the supervise 
> flag:
> - We need to add spark.mesos.driver.frameworkId to the reloaded properties 
> for checkpointing, otherwise the new frameworkId propagated by the dispatcher 
> will be overwritten by the checkpointed data.
> - Unique driver task ids are not used by the dispatcher:
> https://issues.apache.org/jira/browse/MESOS-4737
> https://issues.apache.org/jira/browse/MESOS-3070
> This issue is the same in principle as in the case with standalone mode where 
> the master needs to re-launch drivers with a new appId (driverId) to deal 
> with net partitions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-11-02 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-21725.
-
Resolution: Not A Bug

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>Priority: Major
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMetastoreParquet configuration, and is turned on by 
> default.
> I am confused the problem appear in the table(partitions)  but it is ok with 
> table(with out partitions) . It means spark do not use its own parquet ?
> Maybe someone give any suggest how could I avoid the issue?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22398) Partition directories with leading 0s cause wrong results

2017-11-02 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235710#comment-16235710
 ] 

Marco Gaido commented on SPARK-22398:
-

[~viirya] I see your point. Thanks for your answer.

> Partition directories with leading 0s cause wrong results
> -
>
> Key: SPARK-22398
> URL: https://issues.apache.org/jira/browse/SPARK-22398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bogdan Raducanu
>Priority: Major
>
> Repro case:
> {code}
> spark.range(8).selectExpr("'0' || cast(id as string) as id", "id as 
> b").write.mode("overwrite").partitionBy("id").parquet("/tmp/bug1")
> spark.read.parquet("/tmp/bug1").where("id in ('01')").show
> +---+---+
> |  b| id|
> +---+---+
> +---+---+
> spark.read.parquet("/tmp/bug1").where("id = '01'").show
> +---+---+
> |  b| id|
> +---+---+
> |  1|  1|
> +---+---+
> {code}
> I think somewhere there is some special handling of this case for equals but 
> not the same for IN.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22398) Partition directories with leading 0s cause wrong results

2017-11-02 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235708#comment-16235708
 ] 

Marco Gaido commented on SPARK-22398:
-

[~hyukjin.kwon] I think that here there are two points:
 1) partition with leading 0s are interpreted as integers (and I think this is 
a wrong behavior, but it can be fixed disabling typeInference)
 2) IN type coercion with literals behaves differently from type coercion in 
other parts.
Due to the title of the JIRA I thought that the best option was to track 1 here 
and open a new JIRA with a relevant title for 2.


> Partition directories with leading 0s cause wrong results
> -
>
> Key: SPARK-22398
> URL: https://issues.apache.org/jira/browse/SPARK-22398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bogdan Raducanu
>Priority: Major
>
> Repro case:
> {code}
> spark.range(8).selectExpr("'0' || cast(id as string) as id", "id as 
> b").write.mode("overwrite").partitionBy("id").parquet("/tmp/bug1")
> spark.read.parquet("/tmp/bug1").where("id in ('01')").show
> +---+---+
> |  b| id|
> +---+---+
> +---+---+
> spark.read.parquet("/tmp/bug1").where("id = '01'").show
> +---+---+
> |  b| id|
> +---+---+
> |  1|  1|
> +---+---+
> {code}
> I think somewhere there is some special handling of this case for equals but 
> not the same for IN.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22408) RelationalGroupedDataset's distinct pivot value calculation launches unnecessary stages

2017-11-02 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-22408.
-
   Resolution: Fixed
 Assignee: Patrick Woody
Fix Version/s: 2.3.0

> RelationalGroupedDataset's distinct pivot value calculation launches 
> unnecessary stages
> ---
>
> Key: SPARK-22408
> URL: https://issues.apache.org/jira/browse/SPARK-22408
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Patrick Woody
>Assignee: Patrick Woody
> Fix For: 2.3.0
>
>
> When calculating the distinct values for a pivot in RelationalGroupedDataset 
> (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala#L322),
>  we sort before doing a take(maxValues + 1).
> We should be able to improve this by adding a global limit before the sort, 
> which should reduce the work of the sort, and by simply doing a collect to 
> avoid multiple launching multiple stages as a part of the take. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11421) Add the ability to add a jar to the current class loader

2017-11-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235649#comment-16235649
 ] 

Apache Spark commented on SPARK-11421:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/19643

> Add the ability to add a jar to the current class loader
> 
>
> Key: SPARK-11421
> URL: https://issues.apache.org/jira/browse/SPARK-11421
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: holdenk
>Priority: Minor
>
> addJar add's jars for future operations, but could also add to the current 
> class loader, this would be really useful in Python & R most likely where 
> some included python code may wish to add some jars.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table

2017-11-02 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22306.
-
   Resolution: Fixed
Fix Version/s: 2.2.1

Issue resolved by pull request 19622
[https://github.com/apache/spark/pull/19622]

> INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
> ---
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>Priority: Critical
> Fix For: 2.2.1
>
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}
> *REPRODUCE (branch-2.2)*
> In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch 
> is good due to SPARK-17729. This is a regression on Spark 2.2 only. By 
> default, Parquet Hive table is affected and only Hive may suffer from this.
> {code}
> hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
> INTO 10 BUCKETS STORED AS PARQUET;
> hive> INSERT INTO t VALUES('a','b');
> hive> DESC FORMATTED t;
> ...
> Num Buckets:  10
> Bucket Columns:   [a, b]
> Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)]
> scala> sql("SELECT * FROM t").show(false)
> hive> DESC FORMATTED t;
> Num Buckets:  -1
> Bucket Columns:   []
> Sort Columns: []
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table

2017-11-02 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22306:
---

Assignee: Wenchen Fan

> INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
> ---
>
> Key: SPARK-22306
> URL: https://issues.apache.org/jira/browse/SPARK-22306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet)
> Spark 2.2.0
>Reporter: David Malinge
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.2.1
>
>
> I noticed some critical changes on my hive tables and realized that they were 
> caused by a simple select on SparkSQL. Looking at the logs, I found out that 
> this select was actually performing an update on the database "Saving 
> case-sensitive schema for table". 
> I then found out that Spark 2.2.0 introduces a new default value for 
> spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE
> The issue is that this update changes critical metadata of the table, in 
> particular:
> - changes the owner to the current user
> - removes bucketing metadata (BUCKETING_COLS, SDS)
> - removes sorting metadata (SORT_COLS)
> Switching the property to: NEVER_INFER prevents the issue.
> Also, note that the damage can be fix manually in Hive with e.g.:
> {code:sql}
> alter table [table_name] 
> clustered by ([col1], [col2]) 
> sorted by ([colA], [colB])
> into [n] buckets
> {code}
> *REPRODUCE (branch-2.2)*
> In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch 
> is good due to SPARK-17729. This is a regression on Spark 2.2 only. By 
> default, Parquet Hive table is affected and only Hive may suffer from this.
> {code}
> hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) 
> INTO 10 BUCKETS STORED AS PARQUET;
> hive> INSERT INTO t VALUES('a','b');
> hive> DESC FORMATTED t;
> ...
> Num Buckets:  10
> Bucket Columns:   [a, b]
> Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)]
> scala> sql("SELECT * FROM t").show(false)
> hive> DESC FORMATTED t;
> Num Buckets:  -1
> Bucket Columns:   []
> Sort Columns: []
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22428) Document spark properties for configuring the ContextCleaner

2017-11-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235559#comment-16235559
 ] 

Sean Owen commented on SPARK-22428:
---

It's probably OK to do so, but not all properties are meant to be guaranteed 
and documented as an API.

> Document spark properties for configuring the ContextCleaner
> 
>
> Key: SPARK-22428
> URL: https://issues.apache.org/jira/browse/SPARK-22428
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>Priority: Minor
>
> The spark properties for configuring the ContextCleaner as described on 
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-service-contextcleaner.html
>  are not documented in the official documentation at 
> https://spark.apache.org/docs/latest/configuration.html#available-properties 
> . 
> As a user I would like to have the following spark properties documented in 
> the official documentation:
> {code:java}
> spark.cleaner.periodicGC.interval
> spark.cleaner.referenceTracking   
> spark.cleaner.referenceTracking.blocking  
> spark.cleaner.referenceTracking.blocking.shuffle  
> spark.cleaner.referenceTracking.cleanCheckpoints
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22428) Document spark properties for configuring the ContextCleaner

2017-11-02 Thread Andreas Maier (JIRA)

Andreas Maier created SPARK-22428:
-

 Summary: Document spark properties for configuring the 
ContextCleaner
 Key: SPARK-22428
 URL: https://issues.apache.org/jira/browse/SPARK-22428
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.2.0
Reporter: Andreas Maier
Priority: Minor


The spark properties for configuring the ContextCleaner as described on 
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-service-contextcleaner.html
 are not documented in the official documentation at 
https://spark.apache.org/docs/latest/configuration.html#available-properties . 

As a user I would like to have the following spark properties documented in the 
official documentation:

{code:java}
spark.cleaner.periodicGC.interval
spark.cleaner.referenceTracking 
spark.cleaner.referenceTracking.blocking
spark.cleaner.referenceTracking.blocking.shuffle
spark.cleaner.referenceTracking.cleanCheckpoints
{code}






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22410) Excessive spill for Pyspark UDF when a row has shrunk

2017-11-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235522#comment-16235522
 ] 

Apache Spark commented on SPARK-22410:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/19642

> Excessive spill for Pyspark UDF when a row has shrunk
> -
>
> Key: SPARK-22410
> URL: https://issues.apache.org/jira/browse/SPARK-22410
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Reproduced on up-to-date master
>Reporter: Clément Stenac
>Priority: Minor
>
> Hi,
> The following code processes 900KB of data and outputs around 2MB of data. 
> However, to process it, Spark needs to spill roughly 12 GB of data.
> {code:python}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> import json
> ss = SparkSession.builder.getOrCreate()
> # Create a few lines of data (5 lines).
> # Each line is made of a string, and an array of 1 strings
> # Total size of data is around 900 KB
> lines_of_file = [ "this is a line" for x in xrange(1) ]
> file_obj = [ "this_is_a_foldername/this_is_a_filename", lines_of_file ]
> data = [ file_obj for x in xrange(5) ]
> # Make a two-columns dataframe out of it
> small_df = ss.sparkContext.parallelize(data).map(lambda x : (x[0], 
> x[1])).toDF(["file", "lines"])
> # We then explode the array, so we now have 5 rows in the dataframe, with 
> 2 columns, the 2nd 
> # column now has only "this is a line" as content
> exploded = small_df.select("file", explode("lines"))
> print("Exploded")
> print(exploded.explain())
> # Now, just process it with a trivial Pyspark UDF that touches the first 
> column
> # (the one which was not an array)
> def split_key(s):
> return s.split("/")[1]
> split_key_udf = udf(split_key, StringType())
> with_filename = exploded.withColumn("filename", split_key_udf("file"))
> # As expected, explain plan is very simple (BatchEval -> Explode -> Project 
> -> ScanExisting)
> print(with_filename.explain())
> # Getting the head will spill around 12 GB of data
> print(with_filename.head())
> {code}
> The spill happens in the HybridRowQueue that is used to merge the part that 
> went through the Python worker and the part that didn't.
> The problem comes from the fact that when it is added to the HybridRowQueue, 
> the UnsafeRow has a totalSizeInBytes of ~24 (seen by adding debug message 
> in HybridRowQueue), whereas, since it's after the explode, the actual size of 
> the row should be in the ~60 bytes range.
> My understanding is that the row has retained the size it consumed *prior* to 
> the explode (at that time, the size of each of the 5 rows was indeed ~24 
> bytes.
> A workaround is to do exploded.cache() before calling the UDF. The fact of 
> going through the InMemoryColumnarTableScan "resets" the wrongful size of the 
> UnsafeRow.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22410) Excessive spill for Pyspark UDF when a row has shrunk

2017-11-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22410:


Assignee: Apache Spark

> Excessive spill for Pyspark UDF when a row has shrunk
> -
>
> Key: SPARK-22410
> URL: https://issues.apache.org/jira/browse/SPARK-22410
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Reproduced on up-to-date master
>Reporter: Clément Stenac
>Assignee: Apache Spark
>Priority: Minor
>
> Hi,
> The following code processes 900KB of data and outputs around 2MB of data. 
> However, to process it, Spark needs to spill roughly 12 GB of data.
> {code:python}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> import json
> ss = SparkSession.builder.getOrCreate()
> # Create a few lines of data (5 lines).
> # Each line is made of a string, and an array of 1 strings
> # Total size of data is around 900 KB
> lines_of_file = [ "this is a line" for x in xrange(1) ]
> file_obj = [ "this_is_a_foldername/this_is_a_filename", lines_of_file ]
> data = [ file_obj for x in xrange(5) ]
> # Make a two-columns dataframe out of it
> small_df = ss.sparkContext.parallelize(data).map(lambda x : (x[0], 
> x[1])).toDF(["file", "lines"])
> # We then explode the array, so we now have 5 rows in the dataframe, with 
> 2 columns, the 2nd 
> # column now has only "this is a line" as content
> exploded = small_df.select("file", explode("lines"))
> print("Exploded")
> print(exploded.explain())
> # Now, just process it with a trivial Pyspark UDF that touches the first 
> column
> # (the one which was not an array)
> def split_key(s):
> return s.split("/")[1]
> split_key_udf = udf(split_key, StringType())
> with_filename = exploded.withColumn("filename", split_key_udf("file"))
> # As expected, explain plan is very simple (BatchEval -> Explode -> Project 
> -> ScanExisting)
> print(with_filename.explain())
> # Getting the head will spill around 12 GB of data
> print(with_filename.head())
> {code}
> The spill happens in the HybridRowQueue that is used to merge the part that 
> went through the Python worker and the part that didn't.
> The problem comes from the fact that when it is added to the HybridRowQueue, 
> the UnsafeRow has a totalSizeInBytes of ~24 (seen by adding debug message 
> in HybridRowQueue), whereas, since it's after the explode, the actual size of 
> the row should be in the ~60 bytes range.
> My understanding is that the row has retained the size it consumed *prior* to 
> the explode (at that time, the size of each of the 5 rows was indeed ~24 
> bytes.
> A workaround is to do exploded.cache() before calling the UDF. The fact of 
> going through the InMemoryColumnarTableScan "resets" the wrongful size of the 
> UnsafeRow.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22410) Excessive spill for Pyspark UDF when a row has shrunk

2017-11-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22410:


Assignee: (was: Apache Spark)

> Excessive spill for Pyspark UDF when a row has shrunk
> -
>
> Key: SPARK-22410
> URL: https://issues.apache.org/jira/browse/SPARK-22410
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Reproduced on up-to-date master
>Reporter: Clément Stenac
>Priority: Minor
>
> Hi,
> The following code processes 900KB of data and outputs around 2MB of data. 
> However, to process it, Spark needs to spill roughly 12 GB of data.
> {code:python}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> import json
> ss = SparkSession.builder.getOrCreate()
> # Create a few lines of data (5 lines).
> # Each line is made of a string, and an array of 1 strings
> # Total size of data is around 900 KB
> lines_of_file = [ "this is a line" for x in xrange(1) ]
> file_obj = [ "this_is_a_foldername/this_is_a_filename", lines_of_file ]
> data = [ file_obj for x in xrange(5) ]
> # Make a two-columns dataframe out of it
> small_df = ss.sparkContext.parallelize(data).map(lambda x : (x[0], 
> x[1])).toDF(["file", "lines"])
> # We then explode the array, so we now have 5 rows in the dataframe, with 
> 2 columns, the 2nd 
> # column now has only "this is a line" as content
> exploded = small_df.select("file", explode("lines"))
> print("Exploded")
> print(exploded.explain())
> # Now, just process it with a trivial Pyspark UDF that touches the first 
> column
> # (the one which was not an array)
> def split_key(s):
> return s.split("/")[1]
> split_key_udf = udf(split_key, StringType())
> with_filename = exploded.withColumn("filename", split_key_udf("file"))
> # As expected, explain plan is very simple (BatchEval -> Explode -> Project 
> -> ScanExisting)
> print(with_filename.explain())
> # Getting the head will spill around 12 GB of data
> print(with_filename.head())
> {code}
> The spill happens in the HybridRowQueue that is used to merge the part that 
> went through the Python worker and the part that didn't.
> The problem comes from the fact that when it is added to the HybridRowQueue, 
> the UnsafeRow has a totalSizeInBytes of ~24 (seen by adding debug message 
> in HybridRowQueue), whereas, since it's after the explode, the actual size of 
> the row should be in the ~60 bytes range.
> My understanding is that the row has retained the size it consumed *prior* to 
> the explode (at that time, the size of each of the 5 rows was indeed ~24 
> bytes.
> A workaround is to do exploded.cache() before calling the UDF. The fact of 
> going through the InMemoryColumnarTableScan "resets" the wrongful size of the 
> UnsafeRow.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22426) Spark AM launching containers on node where External spark shuffle service failed to initialize

2017-11-02 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235500#comment-16235500
 ] 

Prabhu Joseph commented on SPARK-22426:
---

Node and NodeManager process is fine, External Spark Shuffle Service failed to 
initialize on that NodeManager for some reason like SPARK-17433, SPARK-15519

> Spark AM launching containers on node where External spark shuffle service 
> failed to initialize
> ---
>
> Key: SPARK-22426
> URL: https://issues.apache.org/jira/browse/SPARK-22426
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>Priority: Major
>
> When Spark External Shuffle Service on a NodeManager fails, the remote 
> executors will fail while fetching the data from the executors launched on 
> this Node. Spark ApplicationMaster should not launch containers on this Node 
> or not use external shuffle service.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22426) Spark AM launching containers on node where External spark shuffle service failed to initialize

2017-11-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235461#comment-16235461
 ] 

Sean Owen commented on SPARK-22426:
---

If the node has failed, YARN already can't or won't launch anything on that 
NodeManager. Are you saying something slightly different?

> Spark AM launching containers on node where External spark shuffle service 
> failed to initialize
> ---
>
> Key: SPARK-22426
> URL: https://issues.apache.org/jira/browse/SPARK-22426
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>Priority: Major
>
> When Spark External Shuffle Service on a NodeManager fails, the remote 
> executors will fail while fetching the data from the executors launched on 
> this Node. Spark ApplicationMaster should not launch containers on this Node 
> or not use external shuffle service.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22427) StackOverFlowError when using FPGrowth

2017-11-02 Thread lyt (JIRA)

lyt created SPARK-22427:
---

 Summary: StackOverFlowError when using FPGrowth
 Key: SPARK-22427
 URL: https://issues.apache.org/jira/browse/SPARK-22427
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 2.2.0
 Environment: Centos Linux 3.10.0-327.el7.x86_64
java 1.8.0.111
spark 2.2.0
Reporter: lyt
Priority: Normal


code part:
val path = jobConfig.getString("hdfspath")
val vectordata = sc.sparkContext.textFile(path)
val finaldata = sc.createDataset(vectordata.map(obj => {
  obj.split(" ")
}).filter(arr => arr.length > 0)).toDF("items")
val fpg = new FPGrowth()

fpg.setMinSupport(minSupport).setItemsCol("items").setMinConfidence(minConfidence)
val train = fpg.fit(finaldata)
print(train.freqItemsets.count())
print(train.associationRules.count())
train.save("/tmp/FPGModel")

And encountered following exception:
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278)
at 
org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430)
at 
org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2429)
at DataMining.FPGrowth$.runJob(FPGrowth.scala:116)
at DataMining.testFPG$.main(FPGrowth.scala:36)
at DataMining.testFPG.main(FPGrowth.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.StackOverflowError
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:616)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:36)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33)
at com.esotericsoftware.kr

[jira] [Created] (SPARK-22426) Spark AM launching containers on node where External spark shuffle service failed to initialize

2017-11-02 Thread Prabhu Joseph (JIRA)

Prabhu Joseph created SPARK-22426:
-

 Summary: Spark AM launching containers on node where External 
spark shuffle service failed to initialize
 Key: SPARK-22426
 URL: https://issues.apache.org/jira/browse/SPARK-22426
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.6.3
Reporter: Prabhu Joseph
Priority: Major


When Spark External Shuffle Service on a NodeManager fails, the remote 
executors will fail while fetching the data from the executors launched on this 
Node. Spark ApplicationMaster should not launch containers on this Node or not 
use external shuffle service.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22426) Spark AM launching containers on node where External spark shuffle service failed to initialize

2017-11-02 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated SPARK-22426:
--
Component/s: YARN

> Spark AM launching containers on node where External spark shuffle service 
> failed to initialize
> ---
>
> Key: SPARK-22426
> URL: https://issues.apache.org/jira/browse/SPARK-22426
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>Priority: Major
>
> When Spark External Shuffle Service on a NodeManager fails, the remote 
> executors will fail while fetching the data from the executors launched on 
> this Node. Spark ApplicationMaster should not launch containers on this Node 
> or not use external shuffle service.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21911) Parallel Model Evaluation for ML Tuning: PySpark

2017-11-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235444#comment-16235444
 ] 

Apache Spark commented on SPARK-21911:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/19641

> Parallel Model Evaluation for ML Tuning: PySpark
> 
>
> Key: SPARK-21911
> URL: https://issues.apache.org/jira/browse/SPARK-21911
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.3.0
>
>
> Add parallelism support for ML tuning in pyspark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22102) Reusing CliSessionState didn't set correct METASTOREWAREHOUSE

2017-11-02 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-22102.
-
Resolution: Cannot Reproduce

Master branch cannot reproduce

> Reusing CliSessionState didn't set correct METASTOREWAREHOUSE
> -
>
> Key: SPARK-22102
> URL: https://issues.apache.org/jira/browse/SPARK-22102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> It shows the warehouse dir is 
> {{file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse}}, 
> but actually the warehouse dir is {{/user/hive/warehouse}} when create table.
> {noformat}
> [root@wangyuming01 spark-2.3.0-SNAPSHOT-bin-2.6.5]# bin/spark-sql 
> 17/09/22 21:32:40 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.conf.Configuration).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 17/09/22 21:32:45 INFO SparkContext: Running Spark version 2.3.0-SNAPSHOT
> 17/09/22 21:32:45 INFO SparkContext: Submitted application: 
> SparkSQL::192.168.77.55
> 17/09/22 21:32:45 INFO SecurityManager: Changing view acls to: root
> 17/09/22 21:32:45 INFO SecurityManager: Changing modify acls to: root
> 17/09/22 21:32:45 INFO SecurityManager: Changing view acls groups to: 
> 17/09/22 21:32:45 INFO SecurityManager: Changing modify acls groups to: 
> 17/09/22 21:32:45 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(root); groups 
> with view permissions: Set(); users  with modify permissions: Set(root); 
> groups with modify permissions: Set()
> 17/09/22 21:32:45 INFO Utils: Successfully started service 'sparkDriver' on 
> port 43676.
> 17/09/22 21:32:45 INFO SparkEnv: Registering MapOutputTracker
> 17/09/22 21:32:45 INFO SparkEnv: Registering BlockManagerMaster
> 17/09/22 21:32:45 INFO BlockManagerMasterEndpoint: Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 17/09/22 21:32:45 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint 
> up
> 17/09/22 21:32:45 INFO DiskBlockManager: Created local directory at 
> /tmp/blockmgr-f536509f-4e3e-4e08-ae7b-8d9499f8e4a4
> 17/09/22 21:32:45 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
> 17/09/22 21:32:45 INFO SparkEnv: Registering OutputCommitCoordinator
> 17/09/22 21:32:45 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> 17/09/22 21:32:45 INFO Utils: Successfully started service 'SparkUI' on port 
> 4041.
> 17/09/22 21:32:45 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
> http://wangyuming01:4041
> 17/09/22 21:32:45 INFO Executor: Starting executor ID driver on host localhost
> 17/09/22 21:32:45 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44426.
> 17/09/22 21:32:45 INFO NettyBlockTransferService: Server created on 
> wangyuming01:44426
> 17/09/22 21:32:45 INFO BlockManager: Using 
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
> policy
> 17/09/22 21:32:45 INFO BlockManagerMaster: Registering BlockManager 
> BlockManagerId(driver, wangyuming01, 44426, None)
> 17/09/22 21:32:45 INFO BlockManagerMasterEndpoint: Registering block manager 
> wangyuming01:44426 with 366.3 MB RAM, BlockManagerId(driver, wangyuming01, 
> 44426, None)
> 17/09/22 21:32:45 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(driver, wangyuming01, 44426, None)
> 17/09/22 21:32:45 INFO BlockManager: Initialized BlockManager: 
> BlockManagerId(driver, wangyuming01, 44426, None)
> 17/09/22 21:32:45 INFO SharedState: Setting hive.metastore.warehouse.dir 
> ('null') to the value of spark.sql.warehouse.dir 
> ('file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse').
> 17/09/22 21:32:45 INFO SharedState: Warehouse path is 
> 'file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse'.
> 17/09/22 21:32:46 INFO HiveUtils: Initializing HiveMetastoreConnection 
> version 1.2.1 using Spark classes.
> 17/09/22 21:32:46 INFO HiveClientImpl: Warehouse location for Hive client 
> (version 1.2.2) is 
> file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse
> 17/09/22 21:32:46 INFO metastore: Mestastore configuration 
> hive.metastore.warehouse.dir changed from /user/hive/warehouse to 
> file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse
> 17/09/22 21:

[jira] [Reopened] (SPARK-22102) Reusing CliSessionState didn't set correct METASTOREWAREHOUSE

2017-11-02 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reopened SPARK-22102:
-

> Reusing CliSessionState didn't set correct METASTOREWAREHOUSE
> -
>
> Key: SPARK-22102
> URL: https://issues.apache.org/jira/browse/SPARK-22102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> It shows the warehouse dir is 
> {{file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse}}, 
> but actually the warehouse dir is {{/user/hive/warehouse}} when create table.
> {noformat}
> [root@wangyuming01 spark-2.3.0-SNAPSHOT-bin-2.6.5]# bin/spark-sql 
> 17/09/22 21:32:40 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.conf.Configuration).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 17/09/22 21:32:45 INFO SparkContext: Running Spark version 2.3.0-SNAPSHOT
> 17/09/22 21:32:45 INFO SparkContext: Submitted application: 
> SparkSQL::192.168.77.55
> 17/09/22 21:32:45 INFO SecurityManager: Changing view acls to: root
> 17/09/22 21:32:45 INFO SecurityManager: Changing modify acls to: root
> 17/09/22 21:32:45 INFO SecurityManager: Changing view acls groups to: 
> 17/09/22 21:32:45 INFO SecurityManager: Changing modify acls groups to: 
> 17/09/22 21:32:45 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(root); groups 
> with view permissions: Set(); users  with modify permissions: Set(root); 
> groups with modify permissions: Set()
> 17/09/22 21:32:45 INFO Utils: Successfully started service 'sparkDriver' on 
> port 43676.
> 17/09/22 21:32:45 INFO SparkEnv: Registering MapOutputTracker
> 17/09/22 21:32:45 INFO SparkEnv: Registering BlockManagerMaster
> 17/09/22 21:32:45 INFO BlockManagerMasterEndpoint: Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 17/09/22 21:32:45 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint 
> up
> 17/09/22 21:32:45 INFO DiskBlockManager: Created local directory at 
> /tmp/blockmgr-f536509f-4e3e-4e08-ae7b-8d9499f8e4a4
> 17/09/22 21:32:45 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
> 17/09/22 21:32:45 INFO SparkEnv: Registering OutputCommitCoordinator
> 17/09/22 21:32:45 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> 17/09/22 21:32:45 INFO Utils: Successfully started service 'SparkUI' on port 
> 4041.
> 17/09/22 21:32:45 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
> http://wangyuming01:4041
> 17/09/22 21:32:45 INFO Executor: Starting executor ID driver on host localhost
> 17/09/22 21:32:45 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44426.
> 17/09/22 21:32:45 INFO NettyBlockTransferService: Server created on 
> wangyuming01:44426
> 17/09/22 21:32:45 INFO BlockManager: Using 
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
> policy
> 17/09/22 21:32:45 INFO BlockManagerMaster: Registering BlockManager 
> BlockManagerId(driver, wangyuming01, 44426, None)
> 17/09/22 21:32:45 INFO BlockManagerMasterEndpoint: Registering block manager 
> wangyuming01:44426 with 366.3 MB RAM, BlockManagerId(driver, wangyuming01, 
> 44426, None)
> 17/09/22 21:32:45 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(driver, wangyuming01, 44426, None)
> 17/09/22 21:32:45 INFO BlockManager: Initialized BlockManager: 
> BlockManagerId(driver, wangyuming01, 44426, None)
> 17/09/22 21:32:45 INFO SharedState: Setting hive.metastore.warehouse.dir 
> ('null') to the value of spark.sql.warehouse.dir 
> ('file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse').
> 17/09/22 21:32:45 INFO SharedState: Warehouse path is 
> 'file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse'.
> 17/09/22 21:32:46 INFO HiveUtils: Initializing HiveMetastoreConnection 
> version 1.2.1 using Spark classes.
> 17/09/22 21:32:46 INFO HiveClientImpl: Warehouse location for Hive client 
> (version 1.2.2) is 
> file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse
> 17/09/22 21:32:46 INFO metastore: Mestastore configuration 
> hive.metastore.warehouse.dir changed from /user/hive/warehouse to 
> file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse
> 17/09/22 21:32:46 INFO HiveMetaStore: 0: Shutting down the object store...
>

[jira] [Resolved] (SPARK-22102) Reusing CliSessionState didn't set correct METASTOREWAREHOUSE

2017-11-02 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-22102.
-
Resolution: Fixed

> Reusing CliSessionState didn't set correct METASTOREWAREHOUSE
> -
>
> Key: SPARK-22102
> URL: https://issues.apache.org/jira/browse/SPARK-22102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> It shows the warehouse dir is 
> {{file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse}}, 
> but actually the warehouse dir is {{/user/hive/warehouse}} when create table.
> {noformat}
> [root@wangyuming01 spark-2.3.0-SNAPSHOT-bin-2.6.5]# bin/spark-sql 
> 17/09/22 21:32:40 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.conf.Configuration).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 17/09/22 21:32:45 INFO SparkContext: Running Spark version 2.3.0-SNAPSHOT
> 17/09/22 21:32:45 INFO SparkContext: Submitted application: 
> SparkSQL::192.168.77.55
> 17/09/22 21:32:45 INFO SecurityManager: Changing view acls to: root
> 17/09/22 21:32:45 INFO SecurityManager: Changing modify acls to: root
> 17/09/22 21:32:45 INFO SecurityManager: Changing view acls groups to: 
> 17/09/22 21:32:45 INFO SecurityManager: Changing modify acls groups to: 
> 17/09/22 21:32:45 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(root); groups 
> with view permissions: Set(); users  with modify permissions: Set(root); 
> groups with modify permissions: Set()
> 17/09/22 21:32:45 INFO Utils: Successfully started service 'sparkDriver' on 
> port 43676.
> 17/09/22 21:32:45 INFO SparkEnv: Registering MapOutputTracker
> 17/09/22 21:32:45 INFO SparkEnv: Registering BlockManagerMaster
> 17/09/22 21:32:45 INFO BlockManagerMasterEndpoint: Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 17/09/22 21:32:45 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint 
> up
> 17/09/22 21:32:45 INFO DiskBlockManager: Created local directory at 
> /tmp/blockmgr-f536509f-4e3e-4e08-ae7b-8d9499f8e4a4
> 17/09/22 21:32:45 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
> 17/09/22 21:32:45 INFO SparkEnv: Registering OutputCommitCoordinator
> 17/09/22 21:32:45 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> 17/09/22 21:32:45 INFO Utils: Successfully started service 'SparkUI' on port 
> 4041.
> 17/09/22 21:32:45 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
> http://wangyuming01:4041
> 17/09/22 21:32:45 INFO Executor: Starting executor ID driver on host localhost
> 17/09/22 21:32:45 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44426.
> 17/09/22 21:32:45 INFO NettyBlockTransferService: Server created on 
> wangyuming01:44426
> 17/09/22 21:32:45 INFO BlockManager: Using 
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
> policy
> 17/09/22 21:32:45 INFO BlockManagerMaster: Registering BlockManager 
> BlockManagerId(driver, wangyuming01, 44426, None)
> 17/09/22 21:32:45 INFO BlockManagerMasterEndpoint: Registering block manager 
> wangyuming01:44426 with 366.3 MB RAM, BlockManagerId(driver, wangyuming01, 
> 44426, None)
> 17/09/22 21:32:45 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(driver, wangyuming01, 44426, None)
> 17/09/22 21:32:45 INFO BlockManager: Initialized BlockManager: 
> BlockManagerId(driver, wangyuming01, 44426, None)
> 17/09/22 21:32:45 INFO SharedState: Setting hive.metastore.warehouse.dir 
> ('null') to the value of spark.sql.warehouse.dir 
> ('file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse').
> 17/09/22 21:32:45 INFO SharedState: Warehouse path is 
> 'file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse'.
> 17/09/22 21:32:46 INFO HiveUtils: Initializing HiveMetastoreConnection 
> version 1.2.1 using Spark classes.
> 17/09/22 21:32:46 INFO HiveClientImpl: Warehouse location for Hive client 
> (version 1.2.2) is 
> file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse
> 17/09/22 21:32:46 INFO metastore: Mestastore configuration 
> hive.metastore.warehouse.dir changed from /user/hive/warehouse to 
> file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse
> 17/09/22 21:32:46 INFO HiveMetaStore: 0: Shutting down

[jira] [Commented] (SPARK-16986) "Started" time, "Completed" time and "Last Updated" time in history server UI are not user local time

2017-11-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235406#comment-16235406
 ] 

Apache Spark commented on SPARK-16986:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/19640

> "Started" time, "Completed" time and "Last Updated" time in history server UI 
> are not user local time
> -
>
> Key: SPARK-16986
> URL: https://issues.apache.org/jira/browse/SPARK-16986
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Weiqing Yang
>Priority: Minor
>
> Currently, "Started" time, "Completed" time and "Last Updated" time in 
> history server UI are GMT. They should be the user local time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22425) add output files information to EventLogger

2017-11-02 Thread Long Tian (JIRA)

Long Tian created SPARK-22425:
-

 Summary: add output files information to EventLogger
 Key: SPARK-22425
 URL: https://issues.apache.org/jira/browse/SPARK-22425
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Long Tian
Priority: Normal


We can get all the input files from *EventLogger* when *spark.eventLog.enabled* 
is *true*. But there's no output files information. Is it possible to add some 
output files information to *EventLogger*? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs

2017-11-02 Thread chengning (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235384#comment-16235384
 ] 

chengning edited comment on SPARK-22424 at 11/2/17 8:41 AM:


sorry, my picture not display, I post it again.
[^1.jpg],  as shows in this picture, the batch "2017/11/01 16:40:55" not 
finished


was (Author: chengning):

sorry, my picture not display, I post it again.
[^1.jpg]

> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> 
>
> Key: SPARK-22424
> URL: https://issues.apache.org/jira/browse/SPARK-22424
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: chengning
>Priority: Blocking
> Attachments: 1.jpg, 1.png, C33oL.jpg
>
>
> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> Thanks a lot.
> !https://i.stack.imgur.com/C33oL.jpg!
> !C33oL.jpg|thumbnail!
> executor log:
> 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 213492
> 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 
> (TID 213492)
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 
> non-empty blocks out of 30 blocks
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote 
> fetches in 1 ms
> 17:32:28.447: tcPartition=7 ms
> 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 
> (TID 213492). 2755 bytes result sent to driver
> driver log：:
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage 
> 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 
> bytes)
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage 
> 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200)
> 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose 
> tasks have all completed, from pool 
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 
> (foreachPartition at Counter2.java:152) finished in 0.255 s
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: 
> foreachPartition at Counter2.java:152, took 0.415256 s



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs

2017-11-02 Thread chengning (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235384#comment-16235384
 ] 

chengning edited comment on SPARK-22424 at 11/2/17 8:31 AM:



sorry, my picture not display, I post it again.
[^1.jpg]


was (Author: chengning):
!1.jpg|thumbnail!   
sorry, my picture not display, I post it again.

> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> 
>
> Key: SPARK-22424
> URL: https://issues.apache.org/jira/browse/SPARK-22424
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: chengning
>Priority: Blocking
> Attachments: 1.jpg, 1.png, C33oL.jpg
>
>
> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> Thanks a lot.
> !https://i.stack.imgur.com/C33oL.jpg!
> !C33oL.jpg|thumbnail!
> executor log:
> 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 213492
> 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 
> (TID 213492)
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 
> non-empty blocks out of 30 blocks
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote 
> fetches in 1 ms
> 17:32:28.447: tcPartition=7 ms
> 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 
> (TID 213492). 2755 bytes result sent to driver
> driver log：:
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage 
> 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 
> bytes)
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage 
> 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200)
> 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose 
> tasks have all completed, from pool 
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 
> (foreachPartition at Counter2.java:152) finished in 0.255 s
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: 
> foreachPartition at Counter2.java:152, took 0.415256 s



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs

2017-11-02 Thread chengning (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235384#comment-16235384
 ] 

chengning commented on SPARK-22424:
---

!1.jpg|thumbnail!   
sorry, my picture not display, I post it again.

> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> 
>
> Key: SPARK-22424
> URL: https://issues.apache.org/jira/browse/SPARK-22424
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: chengning
>Priority: Blocking
> Attachments: 1.jpg, 1.png, C33oL.jpg
>
>
> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> Thanks a lot.
> !https://i.stack.imgur.com/C33oL.jpg!
> !C33oL.jpg|thumbnail!
> executor log:
> 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 213492
> 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 
> (TID 213492)
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 
> non-empty blocks out of 30 blocks
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote 
> fetches in 1 ms
> 17:32:28.447: tcPartition=7 ms
> 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 
> (TID 213492). 2755 bytes result sent to driver
> driver log：:
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage 
> 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 
> bytes)
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage 
> 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200)
> 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose 
> tasks have all completed, from pool 
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 
> (foreachPartition at Counter2.java:152) finished in 0.255 s
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: 
> foreachPartition at Counter2.java:152, took 0.415256 s



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs

2017-11-02 Thread chengning (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chengning updated SPARK-22424:
--
Attachment: 1.jpg

> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> 
>
> Key: SPARK-22424
> URL: https://issues.apache.org/jira/browse/SPARK-22424
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: chengning
>Priority: Blocking
> Attachments: 1.jpg, 1.png, C33oL.jpg
>
>
> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> Thanks a lot.
> !https://i.stack.imgur.com/C33oL.jpg!
> !C33oL.jpg|thumbnail!
> executor log:
> 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 213492
> 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 
> (TID 213492)
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 
> non-empty blocks out of 30 blocks
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote 
> fetches in 1 ms
> 17:32:28.447: tcPartition=7 ms
> 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 
> (TID 213492). 2755 bytes result sent to driver
> driver log：:
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage 
> 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 
> bytes)
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage 
> 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200)
> 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose 
> tasks have all completed, from pool 
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 
> (foreachPartition at Counter2.java:152) finished in 0.255 s
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: 
> foreachPartition at Counter2.java:152, took 0.415256 s



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs

2017-11-02 Thread chengning (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235379#comment-16235379
 ] 

chengning edited comment on SPARK-22424 at 11/2/17 8:25 AM:


I have another picture shows clearly

!1.png|thumbnail!


executor
17/11/01 16:40:55 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
640603
17/11/01 16:40:55 INFO executor.Executor: Running task 3.0 in stage 8218.0 (TID 
640603)
17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Updating epoch to 2319 and 
clearing cache
17/11/01 16:40:55 INFO broadcast.TorrentBroadcast: Started reading broadcast 
variable 8218
17/11/01 16:40:55 INFO memory.MemoryStore: Block broadcast_8218_piece0 stored 
as bytes in memory (estimated size 15.2 KB, free 2.2 GB)
17/11/01 16:40:55 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
8218 took 6 ms
17/11/01 16:40:55 INFO memory.MemoryStore: Block broadcast_8218 stored as 
values in memory (estimated size 31.5 KB, free 2.2 GB)
17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Don't have map outputs for 
shuffle 2318, fetching them
17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@10.110.155.57:33084)
17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Got the output locations
17/11/01 16:40:55 INFO storage.ShuffleBlockFetcherIterator: Getting 28 
non-empty blocks out of 30 blocks
17/11/01 16:40:55 INFO storage.ShuffleBlockFetcherIterator: Started 27 remote 
fetches in 3 ms
17/11/01 16:40:55 INFO codegen.CodeGenerator: Code generated in 21.652093 ms
17/11/01 16:40:55 INFO executor.Executor: Finished task 3.0 in stage 8218.0 
(TID 640603). 3554 bytes result sent to 


driver

17/11/01 16:40:55 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 
8218.0 (TID 640603, Letv6CU621YYPS, executor 12, partition 3, PROCESS_LOCAL, 
6324 bytes)
17/11/01 16:40:55 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 
8218.0 (TID 640603) in 167 ms on Letv6CU621YYPS (executor 12) (16/200)
17/11/01 16:40:55 ERROR scheduler.LiveListenerBus: Dropping SparkListenerEvent 
because no remaining room in event queue. This likely means one of the 
SparkListeners is too slow and cannot keep up with the rate at which tasks are 
being started by the scheduler.



was (Author: chengning):
I have another picture shows clearly


!1.png|thumbnail!


executor
17/11/01 16:40:55 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
640603
17/11/01 16:40:55 INFO executor.Executor: Running task 3.0 in stage 8218.0 (TID 
640603)
17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Updating epoch to 2319 and 
clearing cache
17/11/01 16:40:55 INFO broadcast.TorrentBroadcast: Started reading broadcast 
variable 8218
17/11/01 16:40:55 INFO memory.MemoryStore: Block broadcast_8218_piece0 stored 
as bytes in memory (estimated size 15.2 KB, free 2.2 GB)
17/11/01 16:40:55 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
8218 took 6 ms
17/11/01 16:40:55 INFO memory.MemoryStore: Block broadcast_8218 stored as 
values in memory (estimated size 31.5 KB, free 2.2 GB)
17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Don't have map outputs for 
shuffle 2318, fetching them
17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@10.110.155.57:33084)
17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Got the output locations
17/11/01 16:40:55 INFO storage.ShuffleBlockFetcherIterator: Getting 28 
non-empty blocks out of 30 blocks
17/11/01 16:40:55 INFO storage.ShuffleBlockFetcherIterator: Started 27 remote 
fetches in 3 ms
17/11/01 16:40:55 INFO codegen.CodeGenerator: Code generated in 21.652093 ms
17/11/01 16:40:55 INFO executor.Executor: Finished task 3.0 in stage 8218.0 
(TID 640603). 3554 bytes result sent to 


driver

17/11/01 16:40:55 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 
8218.0 (TID 640603, Letv6CU621YYPS, executor 12, partition 3, PROCESS_LOCAL, 
6324 bytes)
17/11/01 16:40:55 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 
8218.0 (TID 640603) in 167 ms on Letv6CU621YYPS (executor 12) (16/200)
17/11/01 16:40:55 ERROR scheduler.LiveListenerBus: Dropping SparkListenerEvent 
because no remaining room in event queue. This likely means one of the 
SparkListeners is too slow and cannot keep up with the rate at which tasks are 
being started by the scheduler.


> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> 
>
> Key: SPARK-22424
> URL: https://issues.apache.org/jira/browse/SPARK-22424
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: chengning

[jira] [Commented] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs

2017-11-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235381#comment-16235381
 ] 

Sean Owen commented on SPARK-22424:
---

I'm not following. You're circling different tasks. But again the one you 
mention in the logs shows as completed in both places.

> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> 
>
> Key: SPARK-22424
> URL: https://issues.apache.org/jira/browse/SPARK-22424
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: chengning
>Priority: Blocking
> Attachments: 1.png, C33oL.jpg
>
>
> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> Thanks a lot.
> !https://i.stack.imgur.com/C33oL.jpg!
> !C33oL.jpg|thumbnail!
> executor log:
> 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 213492
> 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 
> (TID 213492)
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 
> non-empty blocks out of 30 blocks
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote 
> fetches in 1 ms
> 17:32:28.447: tcPartition=7 ms
> 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 
> (TID 213492). 2755 bytes result sent to driver
> driver log：:
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage 
> 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 
> bytes)
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage 
> 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200)
> 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose 
> tasks have all completed, from pool 
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 
> (foreachPartition at Counter2.java:152) finished in 0.255 s
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: 
> foreachPartition at Counter2.java:152, took 0.415256 s



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs

2017-11-02 Thread chengning (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235379#comment-16235379
 ] 

chengning commented on SPARK-22424:
---

I have another picture shows clearly


!1.png|thumbnail!


executor
17/11/01 16:40:55 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
640603
17/11/01 16:40:55 INFO executor.Executor: Running task 3.0 in stage 8218.0 (TID 
640603)
17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Updating epoch to 2319 and 
clearing cache
17/11/01 16:40:55 INFO broadcast.TorrentBroadcast: Started reading broadcast 
variable 8218
17/11/01 16:40:55 INFO memory.MemoryStore: Block broadcast_8218_piece0 stored 
as bytes in memory (estimated size 15.2 KB, free 2.2 GB)
17/11/01 16:40:55 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
8218 took 6 ms
17/11/01 16:40:55 INFO memory.MemoryStore: Block broadcast_8218 stored as 
values in memory (estimated size 31.5 KB, free 2.2 GB)
17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Don't have map outputs for 
shuffle 2318, fetching them
17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@10.110.155.57:33084)
17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Got the output locations
17/11/01 16:40:55 INFO storage.ShuffleBlockFetcherIterator: Getting 28 
non-empty blocks out of 30 blocks
17/11/01 16:40:55 INFO storage.ShuffleBlockFetcherIterator: Started 27 remote 
fetches in 3 ms
17/11/01 16:40:55 INFO codegen.CodeGenerator: Code generated in 21.652093 ms
17/11/01 16:40:55 INFO executor.Executor: Finished task 3.0 in stage 8218.0 
(TID 640603). 3554 bytes result sent to 


driver

17/11/01 16:40:55 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 
8218.0 (TID 640603, Letv6CU621YYPS, executor 12, partition 3, PROCESS_LOCAL, 
6324 bytes)
17/11/01 16:40:55 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 
8218.0 (TID 640603) in 167 ms on Letv6CU621YYPS (executor 12) (16/200)
17/11/01 16:40:55 ERROR scheduler.LiveListenerBus: Dropping SparkListenerEvent 
because no remaining room in event queue. This likely means one of the 
SparkListeners is too slow and cannot keep up with the rate at which tasks are 
being started by the scheduler.


> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> 
>
> Key: SPARK-22424
> URL: https://issues.apache.org/jira/browse/SPARK-22424
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: chengning
>Priority: Blocking
> Attachments: 1.png, C33oL.jpg
>
>
> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> Thanks a lot.
> !https://i.stack.imgur.com/C33oL.jpg!
> !C33oL.jpg|thumbnail!
> executor log:
> 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 213492
> 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 
> (TID 213492)
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 
> non-empty blocks out of 30 blocks
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote 
> fetches in 1 ms
> 17:32:28.447: tcPartition=7 ms
> 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 
> (TID 213492). 2755 bytes result sent to driver
> driver log：:
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage 
> 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 
> bytes)
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage 
> 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200)
> 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose 
> tasks have all completed, from pool 
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 
> (foreachPartition at Counter2.java:152) finished in 0.255 s
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: 
> foreachPartition at Counter2.java:152, took 0.415256 s



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs

2017-11-02 Thread chengning (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chengning updated SPARK-22424:
--
Attachment: 1.png

> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> 
>
> Key: SPARK-22424
> URL: https://issues.apache.org/jira/browse/SPARK-22424
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: chengning
>Priority: Blocking
> Attachments: 1.png, C33oL.jpg
>
>
> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> Thanks a lot.
> !https://i.stack.imgur.com/C33oL.jpg!
> !C33oL.jpg|thumbnail!
> executor log:
> 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 213492
> 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 
> (TID 213492)
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 
> non-empty blocks out of 30 blocks
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote 
> fetches in 1 ms
> 17:32:28.447: tcPartition=7 ms
> 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 
> (TID 213492). 2755 bytes result sent to driver
> driver log：:
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage 
> 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 
> bytes)
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage 
> 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200)
> 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose 
> tasks have all completed, from pool 
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 
> (foreachPartition at Counter2.java:152) finished in 0.255 s
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: 
> foreachPartition at Counter2.java:152, took 0.415256 s



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs

2017-11-02 Thread chengning (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235365#comment-16235365
 ] 

chengning commented on SPARK-22424:
---

Oh, I saw that the state is really SUCCESS,  but Event Timeline show not 
execute,  I guess it cause the batch "2017/09/29 17:32:28" not finished

> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> 
>
> Key: SPARK-22424
> URL: https://issues.apache.org/jira/browse/SPARK-22424
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: chengning
>Priority: Blocking
> Attachments: C33oL.jpg
>
>
> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> Thanks a lot.
> !https://i.stack.imgur.com/C33oL.jpg!
> !C33oL.jpg|thumbnail!
> executor log:
> 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 213492
> 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 
> (TID 213492)
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 
> non-empty blocks out of 30 blocks
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote 
> fetches in 1 ms
> 17:32:28.447: tcPartition=7 ms
> 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 
> (TID 213492). 2755 bytes result sent to driver
> driver log：:
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage 
> 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 
> bytes)
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage 
> 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200)
> 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose 
> tasks have all completed, from pool 
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 
> (foreachPartition at Counter2.java:152) finished in 0.255 s
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: 
> foreachPartition at Counter2.java:152, took 0.415256 s



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs

2017-11-02 Thread chengning (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chengning updated SPARK-22424:
--
Description: 
Task not finished for a long time in monitor UI, but I found it finished in logs

Thanks a lot.

!https://i.stack.imgur.com/C33oL.jpg!


!C33oL.jpg|thumbnail!

executor log:

17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
213492
17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 
(TID 213492)
17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 
non-empty blocks out of 30 blocks
17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote 
fetches in 1 ms
17:32:28.447: tcPartition=7 ms
17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 
(TID 213492). 2755 bytes result sent to driver




driver log：:

17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage 
2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 
bytes)
17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage 
2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200)
17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose 
tasks have all completed, from pool 
17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 
(foreachPartition at Counter2.java:152) finished in 0.255 s
17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: 
foreachPartition at Counter2.java:152, took 0.415256 s

  was:
Task not finished for a long time in monitor UI, but I found it finished in logs

Thanks a lot.

!https://i.stack.imgur.com/C33oL.jpg!

executor log:

17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
213492
17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 
(TID 213492)
17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 
non-empty blocks out of 30 blocks
17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote 
fetches in 1 ms
17:32:28.447: tcPartition=7 ms
17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 
(TID 213492). 2755 bytes result sent to driver




driver log：:

17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage 
2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 
bytes)
17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage 
2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200)
17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose 
tasks have all completed, from pool 
17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 
(foreachPartition at Counter2.java:152) finished in 0.255 s
17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: 
foreachPartition at Counter2.java:152, took 0.415256 s


> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> 
>
> Key: SPARK-22424
> URL: https://issues.apache.org/jira/browse/SPARK-22424
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: chengning
>Priority: Blocking
> Attachments: C33oL.jpg
>
>
> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> Thanks a lot.
> !https://i.stack.imgur.com/C33oL.jpg!
> !C33oL.jpg|thumbnail!
> executor log:
> 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 213492
> 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 
> (TID 213492)
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 
> non-empty blocks out of 30 blocks
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote 
> fetches in 1 ms
> 17:32:28.447: tcPartition=7 ms
> 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 
> (TID 213492). 2755 bytes result sent to driver
> driver log：:
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage 
> 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 
> bytes)
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage 
> 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200)
> 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose 
> tasks have all completed, from pool 
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 
> (foreachPartition at Counter2.java:152) finished in 0.255 s
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: 
> foreachPartition at Counter2.java:152, took 0.415256 s



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---

[jira] [Updated] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs

2017-11-02 Thread chengning (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chengning updated SPARK-22424:
--
Attachment: C33oL.jpg

> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> 
>
> Key: SPARK-22424
> URL: https://issues.apache.org/jira/browse/SPARK-22424
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: chengning
>Priority: Blocking
> Attachments: C33oL.jpg
>
>
> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> Thanks a lot.
> !https://i.stack.imgur.com/C33oL.jpg!
> executor log:
> 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 213492
> 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 
> (TID 213492)
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 
> non-empty blocks out of 30 blocks
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote 
> fetches in 1 ms
> 17:32:28.447: tcPartition=7 ms
> 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 
> (TID 213492). 2755 bytes result sent to driver
> driver log：:
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage 
> 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 
> bytes)
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage 
> 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200)
> 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose 
> tasks have all completed, from pool 
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 
> (foreachPartition at Counter2.java:152) finished in 0.255 s
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: 
> foreachPartition at Counter2.java:152, took 0.415256 s



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs

2017-11-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235344#comment-16235344
 ] 

Sean Owen commented on SPARK-22424:
---

This shows task 52 finished in both logs and UI.

> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> 
>
> Key: SPARK-22424
> URL: https://issues.apache.org/jira/browse/SPARK-22424
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: chengning
>Priority: Blocking
> Attachments: C33oL.jpg
>
>
> Task not finished for a long time in monitor UI, but I found it finished in 
> logs
> Thanks a lot.
> !https://i.stack.imgur.com/C33oL.jpg!
> executor log:
> 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 213492
> 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 
> (TID 213492)
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 
> non-empty blocks out of 30 blocks
> 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote 
> fetches in 1 ms
> 17:32:28.447: tcPartition=7 ms
> 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 
> (TID 213492). 2755 bytes result sent to driver
> driver log：:
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage 
> 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 
> bytes)
> 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage 
> 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200)
> 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose 
> tasks have all completed, from pool 
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 
> (foreachPartition at Counter2.java:152) finished in 0.255 s
> 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: 
> foreachPartition at Counter2.java:152, took 0.415256 s



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs

2017-11-02 Thread chengning (JIRA)

chengning created SPARK-22424:
-

 Summary: Task not finished for a long time in monitor UI, but I 
found it finished in logs
 Key: SPARK-22424
 URL: https://issues.apache.org/jira/browse/SPARK-22424
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: chengning
Priority: Blocking


Task not finished for a long time in monitor UI, but I found it finished in logs

Thanks a lot.

!https://i.stack.imgur.com/C33oL.jpg!

executor log:

17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
213492
17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 
(TID 213492)
17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 
non-empty blocks out of 30 blocks
17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote 
fetches in 1 ms
17:32:28.447: tcPartition=7 ms
17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 
(TID 213492). 2755 bytes result sent to driver




driver log：:

17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage 
2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 
bytes)
17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage 
2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200)
17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose 
tasks have all completed, from pool 
17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 
(foreachPartition at Counter2.java:152) finished in 0.255 s
17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: 
foreachPartition at Counter2.java:152, took 0.415256 s



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22423) Scala test source files like TestHiveSingleton.scala should be in scala source root

2017-11-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22423:
--
Summary: Scala test source files like TestHiveSingleton.scala should be in 
scala source root  (was: The TestHiveSingleton.scala file should be in scala 
directory)

There are several files in the wrong tree. Could you try fixing all of these?

./mllib/src/test/java/org/apache/spark/ml/util/IdentifiableSuite.scala
./streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala
./streaming/src/test/java/org/apache/spark/streaming/api/java/JavaStreamingListenerWrapperSuite.scala
./sql/hive/src/test/java/org/apache/spark/sql/hive/test/TestHiveSingleton.scala

> Scala test source files like TestHiveSingleton.scala should be in scala 
> source root
> ---
>
> Key: SPARK-22423
> URL: https://issues.apache.org/jira/browse/SPARK-22423
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: xubo245
>Priority: Minor
>
> The TestHiveSingleton.scala file should be in scala directory, not in java 
> directory



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22423) The TestHiveSingleton.scala file should be in scala directory

2017-11-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22423:


Assignee: Apache Spark

> The TestHiveSingleton.scala file should be in scala directory
> -
>
> Key: SPARK-22423
> URL: https://issues.apache.org/jira/browse/SPARK-22423
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: xubo245
>Assignee: Apache Spark
>Priority: Minor
>
> The TestHiveSingleton.scala file should be in scala directory, not in java 
> directory



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22423) The TestHiveSingleton.scala file should be in scala directory

2017-11-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235336#comment-16235336
 ] 

Apache Spark commented on SPARK-22423:
--

User 'xubo245' has created a pull request for this issue:
https://github.com/apache/spark/pull/19639

> The TestHiveSingleton.scala file should be in scala directory
> -
>
> Key: SPARK-22423
> URL: https://issues.apache.org/jira/browse/SPARK-22423
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: xubo245
>Priority: Minor
>
> The TestHiveSingleton.scala file should be in scala directory, not in java 
> directory



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22423) The TestHiveSingleton.scala file should be in scala directory

2017-11-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22423:


Assignee: (was: Apache Spark)

> The TestHiveSingleton.scala file should be in scala directory
> -
>
> Key: SPARK-22423
> URL: https://issues.apache.org/jira/browse/SPARK-22423
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: xubo245
>Priority: Minor
>
> The TestHiveSingleton.scala file should be in scala directory, not in java 
> directory



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22419) Hive and Hive Thriftserver jars missing from "without hadoop" build

2017-11-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22419.
---
   Resolution: Not A Problem
Fix Version/s: (was: 2.1.1)

This is on purpose anyway, and questions should go to the mailing list.

> Hive and Hive Thriftserver jars missing from "without hadoop" build
> ---
>
> Key: SPARK-22419
> URL: https://issues.apache.org/jira/browse/SPARK-22419
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 2.1.1
>Reporter: Adam Kramer
>Priority: Minor
>
> The "without hadoop" binary distribution does not have hive-related libraries 
> in the jars directory.  This may be due to Hive being tied to major releases 
> of Hadoop. My project requires using Hadoop 2.8, so "without hadoop" version 
> seemed the best option. Should I use the make-distribution.sh instead?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 108 matches

Mail list logo