[jira] [Commented] (SPARK-22423) Scala test source files like TestHiveSingleton.scala should be in scala source root
[ https://issues.apache.org/jira/browse/SPARK-22423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237176#comment-16237176 ] xubo245 commented on SPARK-22423: - OK, I will fix it. > Scala test source files like TestHiveSingleton.scala should be in scala > source root > --- > > Key: SPARK-22423 > URL: https://issues.apache.org/jira/browse/SPARK-22423 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.2.0 >Reporter: xubo245 >Priority: Minor > > The TestHiveSingleton.scala file should be in scala directory, not in java > directory -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22427) StackOverFlowError when using FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-22427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237174#comment-16237174 ] yuhao yang commented on SPARK-22427: Could you please try to increase the stack size, E.g. with -Xss10m ? > StackOverFlowError when using FPGrowth > -- > > Key: SPARK-22427 > URL: https://issues.apache.org/jira/browse/SPARK-22427 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.2.0 > Environment: Centos Linux 3.10.0-327.el7.x86_64 > java 1.8.0.111 > spark 2.2.0 >Reporter: lyt >Priority: Normal > > code part: > val path = jobConfig.getString("hdfspath") > val vectordata = sc.sparkContext.textFile(path) > val finaldata = sc.createDataset(vectordata.map(obj => { > obj.split(" ") > }).filter(arr => arr.length > 0)).toDF("items") > val fpg = new FPGrowth() > > fpg.setMinSupport(minSupport).setItemsCol("items").setMinConfidence(minConfidence) > val train = fpg.fit(finaldata) > print(train.freqItemsets.count()) > print(train.associationRules.count()) > train.save("/tmp/FPGModel") > And encountered following exception: > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.collect(RDD.scala:935) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278) > at > org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430) > at > org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429) > at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836) > at org.apache.spark.sql.Dataset.count(Dataset.scala:2429) > at DataMining.FPGrowth$.runJob(FPGrowth.scala:116) > at DataMining.testFPG$.main(FPGrowth.scala:36) > at DataMining.testFPG.main(FPGrowth.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) >
[jira] [Updated] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results
[ https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22211: Target Version/s: 2.2.1, 2.3.0 > LimitPushDown optimization for FullOuterJoin generates wrong results > > > Key: SPARK-22211 > URL: https://issues.apache.org/jira/browse/SPARK-22211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: on community.cloude.databrick.com > Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11) >Reporter: Benyi Wang >Priority: Major > > LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may > generate a wrong result: > Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 > is selected, but at right side we have 100K rows including 999, the result > will be > - one row is (999, 999) > - the rest rows are (null, xxx) > Once you call show(), the row (999,999) has only 1/10th chance to be > selected by CollectLimit. > The actual optimization might be, > - push down limit > - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin. > Here is my notebook: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html > {code:java} > import scala.util.Random._ > val dl = shuffle(1 to 10).toDF("id") > val dr = shuffle(1 to 10).toDF("id") > println("data frame dl:") > dl.explain > println("data frame dr:") > dr.explain > val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1) > j.explain > j.show(false) > {code} > {code} > data frame dl: > == Physical Plan == > LocalTableScan [id#10] > data frame dr: > == Physical Plan == > LocalTableScan [id#16] > == Physical Plan == > CollectLimit 1 > +- SortMergeJoin [id#10], [id#16], FullOuter >:- *Sort [id#10 ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#10, 200) >: +- *LocalLimit 1 >:+- LocalTableScan [id#10] >+- *Sort [id#16 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#16, 200) > +- LocalTableScan [id#16] > import scala.util.Random._ > dl: org.apache.spark.sql.DataFrame = [id: int] > dr: org.apache.spark.sql.DataFrame = [id: int] > j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int] > ++---+ > |id |id | > ++---+ > |null|148| > ++---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results
[ https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237133#comment-16237133 ] Xiao Li commented on SPARK-22211: - Will submit a PR based on my previous PR https://github.com/apache/spark/pull/10454 > LimitPushDown optimization for FullOuterJoin generates wrong results > > > Key: SPARK-22211 > URL: https://issues.apache.org/jira/browse/SPARK-22211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: on community.cloude.databrick.com > Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11) >Reporter: Benyi Wang >Priority: Major > > LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may > generate a wrong result: > Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 > is selected, but at right side we have 100K rows including 999, the result > will be > - one row is (999, 999) > - the rest rows are (null, xxx) > Once you call show(), the row (999,999) has only 1/10th chance to be > selected by CollectLimit. > The actual optimization might be, > - push down limit > - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin. > Here is my notebook: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html > {code:java} > import scala.util.Random._ > val dl = shuffle(1 to 10).toDF("id") > val dr = shuffle(1 to 10).toDF("id") > println("data frame dl:") > dl.explain > println("data frame dr:") > dr.explain > val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1) > j.explain > j.show(false) > {code} > {code} > data frame dl: > == Physical Plan == > LocalTableScan [id#10] > data frame dr: > == Physical Plan == > LocalTableScan [id#16] > == Physical Plan == > CollectLimit 1 > +- SortMergeJoin [id#10], [id#16], FullOuter >:- *Sort [id#10 ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#10, 200) >: +- *LocalLimit 1 >:+- LocalTableScan [id#10] >+- *Sort [id#16 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#16, 200) > +- LocalTableScan [id#16] > import scala.util.Random._ > dl: org.apache.spark.sql.DataFrame = [id: int] > dr: org.apache.spark.sql.DataFrame = [id: int] > j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int] > ++---+ > |id |id | > ++---+ > |null|148| > ++---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite
[ https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237121#comment-16237121 ] Nathan Kronenfeld commented on SPARK-22308: --- ok, found the problem - it was the new tests, they weren't cleaning up after themselves. Still trying to get past the hive issues that were keeping me from using maven in the first place, but should have this back to you in the next day or two. > Support unit tests of spark code using ScalaTest using suites other than > FunSuite > - > > Key: SPARK-22308 > URL: https://issues.apache.org/jira/browse/SPARK-22308 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core, SQL, Tests >Affects Versions: 2.2.0 >Reporter: Nathan Kronenfeld >Assignee: Nathan Kronenfeld >Priority: Minor > Labels: scalatest, test-suite, test_issue > > External codebases that have spark code can test it using SharedSparkContext, > no matter how they write their scalatests - basing on FunSuite, FunSpec, > FlatSpec, or WordSpec. > SharedSQLContext only supports FunSuite. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22427) StackOverFlowError when using FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-22427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237112#comment-16237112 ] Kazuaki Ishizaki commented on SPARK-22427: -- Thank you for reporting an issue. Could you please attach the data file? Or, can you write data size with a part of example data? > StackOverFlowError when using FPGrowth > -- > > Key: SPARK-22427 > URL: https://issues.apache.org/jira/browse/SPARK-22427 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.2.0 > Environment: Centos Linux 3.10.0-327.el7.x86_64 > java 1.8.0.111 > spark 2.2.0 >Reporter: lyt >Priority: Normal > > code part: > val path = jobConfig.getString("hdfspath") > val vectordata = sc.sparkContext.textFile(path) > val finaldata = sc.createDataset(vectordata.map(obj => { > obj.split(" ") > }).filter(arr => arr.length > 0)).toDF("items") > val fpg = new FPGrowth() > > fpg.setMinSupport(minSupport).setItemsCol("items").setMinConfidence(minConfidence) > val train = fpg.fit(finaldata) > print(train.freqItemsets.count()) > print(train.associationRules.count()) > train.save("/tmp/FPGModel") > And encountered following exception: > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.collect(RDD.scala:935) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278) > at > org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430) > at > org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429) > at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836) > at org.apache.spark.sql.Dataset.count(Dataset.scala:2429) > at DataMining.FPGrowth$.runJob(FPGrowth.scala:116) > at DataMining.testFPG$.main(FPGrowth.scala:36) > at DataMining.testFPG.main(FPGrowth.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) >
[jira] [Assigned] (SPARK-22254) clean up the implementation of `growToSize` in CompactBuffer
[ https://issues.apache.org/jira/browse/SPARK-22254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22254: Assignee: Apache Spark > clean up the implementation of `growToSize` in CompactBuffer > > > Key: SPARK-22254 > URL: https://issues.apache.org/jira/browse/SPARK-22254 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Feng Liu >Assignee: Apache Spark >Priority: Major > > two issues: > 1. the arrayMax should be `ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH ` > 2. I believe some `-2` were introduced because `Integer.Max_Value` was used > previously. We should make the calculation of newArrayLen concise. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22254) clean up the implementation of `growToSize` in CompactBuffer
[ https://issues.apache.org/jira/browse/SPARK-22254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22254: Assignee: (was: Apache Spark) > clean up the implementation of `growToSize` in CompactBuffer > > > Key: SPARK-22254 > URL: https://issues.apache.org/jira/browse/SPARK-22254 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Feng Liu >Priority: Major > > two issues: > 1. the arrayMax should be `ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH ` > 2. I believe some `-2` were introduced because `Integer.Max_Value` was used > previously. We should make the calculation of newArrayLen concise. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22254) clean up the implementation of `growToSize` in CompactBuffer
[ https://issues.apache.org/jira/browse/SPARK-22254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237035#comment-16237035 ] Apache Spark commented on SPARK-22254: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/19650 > clean up the implementation of `growToSize` in CompactBuffer > > > Key: SPARK-22254 > URL: https://issues.apache.org/jira/browse/SPARK-22254 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Feng Liu >Priority: Major > > two issues: > 1. the arrayMax should be `ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH ` > 2. I believe some `-2` were introduced because `Integer.Max_Value` was used > previously. We should make the calculation of newArrayLen concise. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21791) ORC should support column names with dot
[ https://issues.apache.org/jira/browse/SPARK-21791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237095#comment-16237095 ] Apache Spark commented on SPARK-21791: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/19651 > ORC should support column names with dot > > > Key: SPARK-21791 > URL: https://issues.apache.org/jira/browse/SPARK-21791 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Dongjoon Hyun >Priority: Major > > *PARQUET* > {code} > scala> Seq(Some(1), None).toDF("col.dots").write.parquet("/tmp/parquet_dot") > scala> spark.read.parquet("/tmp/parquet_dot").show > ++ > |col.dots| > ++ > | 1| > |null| > ++ > {code} > *ORC* > {code} > scala> Seq(Some(1), None).toDF("col.dots").write.orc("/tmp/orc_dot") > scala> spark.read.orc("/tmp/orc_dot").show > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '.' expecting ':'(line 1, pos 10) > == SQL == > struct > --^^^ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20682) Add new ORCFileFormat based on Apache ORC
[ https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237093#comment-16237093 ] Apache Spark commented on SPARK-20682: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/19651 > Add new ORCFileFormat based on Apache ORC > - > > Key: SPARK-20682 > URL: https://issues.apache.org/jira/browse/SPARK-20682 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1, 2.2.0 >Reporter: Dongjoon Hyun >Priority: Major > > Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module > with Hive dependency. This issue aims to add a new and faster ORC data source > inside `sql/core` and to replace the old ORC data source eventually. In this > issue, the latest Apache ORC 1.4.0 (released yesterday) is used. > There are four key benefits. > - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is > faster than the current implementation in Spark. > - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC > community more. > - Usability: User can use `ORC` data sources without hive module, i.e, > `-Phive`. > - Maintainability: Reduce the Hive dependency and can remove old legacy code > later. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe
[ https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237094#comment-16237094 ] Apache Spark commented on SPARK-15474: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/19651 > ORC data source fails to write and read back empty dataframe > - > > Key: SPARK-15474 > URL: https://issues.apache.org/jira/browse/SPARK-15474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.1, 2.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently ORC data source fails to write and read empty data. > The code below: > {code} > val emptyDf = spark.range(10).limit(0) > emptyDf.write > .format("orc") > .save(path.getCanonicalPath) > val copyEmptyDf = spark.read > .format("orc") > .load(path.getCanonicalPath) > copyEmptyDf.show() > {code} > throws an exception below: > {code} > Unable to infer schema for ORC at > /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. > It must be specified manually; > org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at > /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. > It must be specified manually; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140) > at > org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892) > at > org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884) > at > org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114) > {code} > Note that this is a different case with the data below > {code} > val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) > {code} > In this case, any writer is not initialised and created. (no calls of > {{WriterContainer.writeRows()}}. > For Parquet and JSON, it works but ORC does not. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20682) Add new ORCFileFormat based on Apache ORC
[ https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-20682: -- Summary: Add new ORCFileFormat based on Apache ORC (was: Support a new faster ORC data source based on Apache ORC) > Add new ORCFileFormat based on Apache ORC > - > > Key: SPARK-20682 > URL: https://issues.apache.org/jira/browse/SPARK-20682 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1, 2.2.0 >Reporter: Dongjoon Hyun >Priority: Major > > Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module > with Hive dependency. This issue aims to add a new and faster ORC data source > inside `sql/core` and to replace the old ORC data source eventually. In this > issue, the latest Apache ORC 1.4.0 (released yesterday) is used. > There are four key benefits. > - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is > faster than the current implementation in Spark. > - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC > community more. > - Usability: User can use `ORC` data sources without hive module, i.e, > `-Phive`. > - Maintainability: Reduce the Hive dependency and can remove old legacy code > later. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22433) Linear regression R^2 train/test terminology related
[ https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Teng Peng updated SPARK-22433: -- Description: Traditional statistics is traditional statistics. Their goal, framework, and terminologies are not the same as ML. However, in linear regression related components, this distinction is not clear, which is reflected: 1. regressionMetric + regressionEvaluator : * R2 shouldn't be there. * A better name "regressionPredictionMetric". 2. LinearRessionSuite: * Shouldn't test R2 and residuals on test data. * There is no train set and test set in this setting. 3. Terminology: there is no "linear regression with L1 regularization". Linear regression is linear. Adding a penalty term, then it is no longer linear. Just call it "LASSO", "ElasticNet". There are more. I am working on correcting them. They are not breaking anything, but it does not make one feel good to see the basic distinction is blurred. was: Traditional statistics is traditional statistics. Their goal, framework, and terminologies are not the same as ML. However, in linear regression related components, this distinction is not clear, which is reflected: 1. regressionMetric + regressionEvaluator : * R2 shouldn't be there. * A better name "regressionPredictionMetric". 2. LinearregRessionSuite: * Shouldn't test R2 and residuals on test data. * There is no train set and test set in this setting. 3. Terminology: there is no "linear regression with L1 regularization". Linear regression is linear. Adding a penalty term, then it is no longer linear. Just call it "LASSO", "ElasticNet". There are more. I am working on correcting them. They are not breaking anything, but it does not make one feel good to see the basic distinction is blurred. > Linear regression R^2 train/test terminology related > - > > Key: SPARK-22433 > URL: https://issues.apache.org/jira/browse/SPARK-22433 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Teng Peng >Priority: Minor > > Traditional statistics is traditional statistics. Their goal, framework, and > terminologies are not the same as ML. However, in linear regression related > components, this distinction is not clear, which is reflected: > 1. regressionMetric + regressionEvaluator : > * R2 shouldn't be there. > * A better name "regressionPredictionMetric". > 2. LinearRessionSuite: > * Shouldn't test R2 and residuals on test data. > * There is no train set and test set in this setting. > 3. Terminology: there is no "linear regression with L1 regularization". > Linear regression is linear. Adding a penalty term, then it is no longer > linear. Just call it "LASSO", "ElasticNet". > There are more. I am working on correcting them. > They are not breaking anything, but it does not make one feel good to see the > basic distinction is blurred. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22433) Linear regression R^2 train/test terminology related
[ https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Teng Peng updated SPARK-22433: -- Description: Traditional statistics is traditional statistics. Their goal, framework, and terminologies are not the same as ML. However, in linear regression related components, this distinction is not clear, which is reflected: 1. regressionMetric + regressionEvaluator : * R2 shouldn't be there. * A better name "regressionPredictionMetric". 2. LinearRegressionSuite: * Shouldn't test R2 and residuals on test data. * There is no train set and test set in this setting. 3. Terminology: there is no "linear regression with L1 regularization". Linear regression is linear. Adding a penalty term, then it is no longer linear. Just call it "LASSO", "ElasticNet". There are more. I am working on correcting them. They are not breaking anything, but it does not make one feel good to see the basic distinction is blurred. was: Traditional statistics is traditional statistics. Their goal, framework, and terminologies are not the same as ML. However, in linear regression related components, this distinction is not clear, which is reflected: 1. regressionMetric + regressionEvaluator : * R2 shouldn't be there. * A better name "regressionPredictionMetric". 2. LinearRessionSuite: * Shouldn't test R2 and residuals on test data. * There is no train set and test set in this setting. 3. Terminology: there is no "linear regression with L1 regularization". Linear regression is linear. Adding a penalty term, then it is no longer linear. Just call it "LASSO", "ElasticNet". There are more. I am working on correcting them. They are not breaking anything, but it does not make one feel good to see the basic distinction is blurred. > Linear regression R^2 train/test terminology related > - > > Key: SPARK-22433 > URL: https://issues.apache.org/jira/browse/SPARK-22433 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Teng Peng >Priority: Minor > > Traditional statistics is traditional statistics. Their goal, framework, and > terminologies are not the same as ML. However, in linear regression related > components, this distinction is not clear, which is reflected: > 1. regressionMetric + regressionEvaluator : > * R2 shouldn't be there. > * A better name "regressionPredictionMetric". > 2. LinearRegressionSuite: > * Shouldn't test R2 and residuals on test data. > * There is no train set and test set in this setting. > 3. Terminology: there is no "linear regression with L1 regularization". > Linear regression is linear. Adding a penalty term, then it is no longer > linear. Just call it "LASSO", "ElasticNet". > There are more. I am working on correcting them. > They are not breaking anything, but it does not make one feel good to see the > basic distinction is blurred. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22433) Linear regression R^2 train/test terminology related
[ https://issues.apache.org/jira/browse/SPARK-22433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Teng Peng updated SPARK-22433: -- Description: Traditional statistics is traditional statistics. Their goal, framework, and terminologies are not the same as ML. However, in linear regression related components, this distinction is not clear, which is reflected: 1. regressionMetric + regressionEvaluator : * R2 shouldn't be there. * A better name "regressionPredictionMetric". 2. LinearregRessionSuite: * Shouldn't test R2 and residuals on test data. * There is no train set and test set in this setting. 3. Terminology: there is no "linear regression with L1 regularization". Linear regression is linear. Adding a penalty term, then it is no longer linear. Just call it "LASSO", "ElasticNet". There are more. I am working on correcting them. They are not breaking anything, but it does not make one feel good to see the basic distinction is blurred. was: Traditional statistics is traditional statistics. Their goal, framework, and terminologies are not the same as ML. However, in linear regression related components, this distinction is not clear, which is reflected: 1. regressionMetric + regressionEvaluator : * R2 shouldn't be there. * A better name "regressionPredictionMetric". 2. LinearregRessionSuite: * Shouldn't test R2 and residuals on test data. * There is no train set and test set in this setting. 3. Terminology: there is no "linear regression with L1 regularization". Linear regression is linear. Adding a penalty term, then it is no longer linear. Just call it "LASSO", "ElasticNet". There are more. I am working on correcting them. They are not breaking anything, but it does not make one feel good to see the basic distinction is blurred. > Linear regression R^2 train/test terminology related > - > > Key: SPARK-22433 > URL: https://issues.apache.org/jira/browse/SPARK-22433 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Teng Peng >Priority: Minor > > Traditional statistics is traditional statistics. Their goal, framework, and > terminologies are not the same as ML. However, in linear regression related > components, this distinction is not clear, which is reflected: > 1. regressionMetric + regressionEvaluator : > * R2 shouldn't be there. > * A better name "regressionPredictionMetric". > 2. LinearregRessionSuite: > * Shouldn't test R2 and residuals on test data. > * There is no train set and test set in this setting. > 3. Terminology: there is no "linear regression with L1 regularization". > Linear regression is linear. Adding a penalty term, then it is no longer > linear. Just call it "LASSO", "ElasticNet". > There are more. I am working on correcting them. > They are not breaking anything, but it does not make one feel good to see the > basic distinction is blurred. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22433) Linear regression R^2 train/test terminology related
Teng Peng created SPARK-22433: - Summary: Linear regression R^2 train/test terminology related Key: SPARK-22433 URL: https://issues.apache.org/jira/browse/SPARK-22433 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: Teng Peng Priority: Minor Traditional statistics is traditional statistics. Their goal, framework, and terminologies are not the same as ML. However, in linear regression related components, this distinction is not clear, which is reflected: 1. regressionMetric + regressionEvaluator : * R2 shouldn't be there. * A better name "regressionPredictionMetric". 2. LinearregRessionSuite: * Shouldn't test R2 and residuals on test data. * There is no train set and test set in this setting. 3. Terminology: there is no "linear regression with L1 regularization". Linear regression is linear. Adding a penalty term, then it is no longer linear. Just call it "LASSO", "ElasticNet". There are more. I am working on correcting them. They are not breaking anything, but it does not make one feel good to see the basic distinction is blurred. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22405) Enrich the event information and add new event of ExternalCatalogEvent
[ https://issues.apache.org/jira/browse/SPARK-22405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236950#comment-16236950 ] Apache Spark commented on SPARK-22405: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/19649 > Enrich the event information and add new event of ExternalCatalogEvent > -- > > Key: SPARK-22405 > URL: https://issues.apache.org/jira/browse/SPARK-22405 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Priority: Minor > > We're building a data lineage tool in which we need to monitor the metadata > changes in {{ExternalCatalog}}, current {{ExternalCatalog}} already provides > several useful events like "CreateDatabaseEvent" for custom SparkListener to > use. But the information provided by such event is not rich enough, for > example {{CreateTablePreEvent}} only provides "database" name and "table" > name, not all the table metadata, which is hard for user to get all the table > related useful information. > So here propose to and new {{ExternalCatalogEvent}} and enrich the current > existing events for all the catalog related updates. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22405) Enrich the event information and add new event of ExternalCatalogEvent
[ https://issues.apache.org/jira/browse/SPARK-22405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22405: Assignee: (was: Apache Spark) > Enrich the event information and add new event of ExternalCatalogEvent > -- > > Key: SPARK-22405 > URL: https://issues.apache.org/jira/browse/SPARK-22405 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Priority: Minor > > We're building a data lineage tool in which we need to monitor the metadata > changes in {{ExternalCatalog}}, current {{ExternalCatalog}} already provides > several useful events like "CreateDatabaseEvent" for custom SparkListener to > use. But the information provided by such event is not rich enough, for > example {{CreateTablePreEvent}} only provides "database" name and "table" > name, not all the table metadata, which is hard for user to get all the table > related useful information. > So here propose to and new {{ExternalCatalogEvent}} and enrich the current > existing events for all the catalog related updates. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22405) Enrich the event information and add new event of ExternalCatalogEvent
[ https://issues.apache.org/jira/browse/SPARK-22405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22405: Assignee: Apache Spark > Enrich the event information and add new event of ExternalCatalogEvent > -- > > Key: SPARK-22405 > URL: https://issues.apache.org/jira/browse/SPARK-22405 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Assignee: Apache Spark >Priority: Minor > > We're building a data lineage tool in which we need to monitor the metadata > changes in {{ExternalCatalog}}, current {{ExternalCatalog}} already provides > several useful events like "CreateDatabaseEvent" for custom SparkListener to > use. But the information provided by such event is not rich enough, for > example {{CreateTablePreEvent}} only provides "database" name and "table" > name, not all the table metadata, which is hard for user to get all the table > related useful information. > So here propose to and new {{ExternalCatalogEvent}} and enrich the current > existing events for all the catalog related updates. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22426) Spark AM launching containers on node where External spark shuffle service failed to initialize
[ https://issues.apache.org/jira/browse/SPARK-22426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236939#comment-16236939 ] Saisai Shao commented on SPARK-22426: - This kind of scenario was handled in SPARK-13669 regarding to blacklist mechanism. > Spark AM launching containers on node where External spark shuffle service > failed to initialize > --- > > Key: SPARK-22426 > URL: https://issues.apache.org/jira/browse/SPARK-22426 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph >Priority: Major > > When Spark External Shuffle Service on a NodeManager fails, the remote > executors will fail while fetching the data from the executors launched on > this Node. Spark ApplicationMaster should not launch containers on this Node > or not use external shuffle service. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14516) Clustering evaluator
[ https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236915#comment-16236915 ] Apache Spark commented on SPARK-14516: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/19648 > Clustering evaluator > > > Key: SPARK-14516 > URL: https://issues.apache.org/jira/browse/SPARK-14516 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Assignee: Marco Gaido >Priority: Major > Fix For: 2.3.0 > > > MLlib does not have any general purposed clustering metrics with a ground > truth. > In > [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics), > there are several kinds of metrics for this. > It may be meaningful to add some clustering metrics into MLlib. > This should be added as a {{ClusteringEvaluator}} class of extending > {{Evaluator}} in spark.ml. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21087) CrossValidator, TrainValidationSplit should collect all models when fitting: Scala API
[ https://issues.apache.org/jira/browse/SPARK-21087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-21087: - Assignee: Weichen Xu > CrossValidator, TrainValidationSplit should collect all models when fitting: > Scala API > -- > > Key: SPARK-21087 > URL: https://issues.apache.org/jira/browse/SPARK-21087 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Major > > We add a parameter whether to collect the full model list when > CrossValidator/TrainValidationSplit training (Default is NOT, avoid the > change cause OOM) > Add a method in CrossValidatorModel/TrainValidationSplitModel, allow user to > get the model list > CrossValidatorModelWriter add a “option”, allow user to control whether to > persist the model list to disk. > Note: when persisting the model list, use indices as the sub-model path -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21087) CrossValidator, TrainValidationSplit should collect all models when fitting: Scala API
[ https://issues.apache.org/jira/browse/SPARK-21087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-21087: -- Shepherd: Joseph K. Bradley > CrossValidator, TrainValidationSplit should collect all models when fitting: > Scala API > -- > > Key: SPARK-21087 > URL: https://issues.apache.org/jira/browse/SPARK-21087 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Major > > We add a parameter whether to collect the full model list when > CrossValidator/TrainValidationSplit training (Default is NOT, avoid the > change cause OOM) > Add a method in CrossValidatorModel/TrainValidationSplitModel, allow user to > get the model list > CrossValidatorModelWriter add a “option”, allow user to control whether to > persist the model list to disk. > Note: when persisting the model list, use indices as the sub-model path -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results
[ https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22211: Assignee: (was: Apache Spark) > LimitPushDown optimization for FullOuterJoin generates wrong results > > > Key: SPARK-22211 > URL: https://issues.apache.org/jira/browse/SPARK-22211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: on community.cloude.databrick.com > Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11) >Reporter: Benyi Wang >Priority: Major > > LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may > generate a wrong result: > Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 > is selected, but at right side we have 100K rows including 999, the result > will be > - one row is (999, 999) > - the rest rows are (null, xxx) > Once you call show(), the row (999,999) has only 1/10th chance to be > selected by CollectLimit. > The actual optimization might be, > - push down limit > - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin. > Here is my notebook: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html > {code:java} > import scala.util.Random._ > val dl = shuffle(1 to 10).toDF("id") > val dr = shuffle(1 to 10).toDF("id") > println("data frame dl:") > dl.explain > println("data frame dr:") > dr.explain > val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1) > j.explain > j.show(false) > {code} > {code} > data frame dl: > == Physical Plan == > LocalTableScan [id#10] > data frame dr: > == Physical Plan == > LocalTableScan [id#16] > == Physical Plan == > CollectLimit 1 > +- SortMergeJoin [id#10], [id#16], FullOuter >:- *Sort [id#10 ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#10, 200) >: +- *LocalLimit 1 >:+- LocalTableScan [id#10] >+- *Sort [id#16 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#16, 200) > +- LocalTableScan [id#16] > import scala.util.Random._ > dl: org.apache.spark.sql.DataFrame = [id: int] > dr: org.apache.spark.sql.DataFrame = [id: int] > j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int] > ++---+ > |id |id | > ++---+ > |null|148| > ++---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results
[ https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236861#comment-16236861 ] Apache Spark commented on SPARK-22211: -- User 'henryr' has created a pull request for this issue: https://github.com/apache/spark/pull/19647 > LimitPushDown optimization for FullOuterJoin generates wrong results > > > Key: SPARK-22211 > URL: https://issues.apache.org/jira/browse/SPARK-22211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: on community.cloude.databrick.com > Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11) >Reporter: Benyi Wang >Priority: Major > > LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may > generate a wrong result: > Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 > is selected, but at right side we have 100K rows including 999, the result > will be > - one row is (999, 999) > - the rest rows are (null, xxx) > Once you call show(), the row (999,999) has only 1/10th chance to be > selected by CollectLimit. > The actual optimization might be, > - push down limit > - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin. > Here is my notebook: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html > {code:java} > import scala.util.Random._ > val dl = shuffle(1 to 10).toDF("id") > val dr = shuffle(1 to 10).toDF("id") > println("data frame dl:") > dl.explain > println("data frame dr:") > dr.explain > val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1) > j.explain > j.show(false) > {code} > {code} > data frame dl: > == Physical Plan == > LocalTableScan [id#10] > data frame dr: > == Physical Plan == > LocalTableScan [id#16] > == Physical Plan == > CollectLimit 1 > +- SortMergeJoin [id#10], [id#16], FullOuter >:- *Sort [id#10 ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#10, 200) >: +- *LocalLimit 1 >:+- LocalTableScan [id#10] >+- *Sort [id#16 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#16, 200) > +- LocalTableScan [id#16] > import scala.util.Random._ > dl: org.apache.spark.sql.DataFrame = [id: int] > dr: org.apache.spark.sql.DataFrame = [id: int] > j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int] > ++---+ > |id |id | > ++---+ > |null|148| > ++---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results
[ https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22211: Assignee: Apache Spark > LimitPushDown optimization for FullOuterJoin generates wrong results > > > Key: SPARK-22211 > URL: https://issues.apache.org/jira/browse/SPARK-22211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: on community.cloude.databrick.com > Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11) >Reporter: Benyi Wang >Assignee: Apache Spark >Priority: Major > > LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may > generate a wrong result: > Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 > is selected, but at right side we have 100K rows including 999, the result > will be > - one row is (999, 999) > - the rest rows are (null, xxx) > Once you call show(), the row (999,999) has only 1/10th chance to be > selected by CollectLimit. > The actual optimization might be, > - push down limit > - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin. > Here is my notebook: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html > {code:java} > import scala.util.Random._ > val dl = shuffle(1 to 10).toDF("id") > val dr = shuffle(1 to 10).toDF("id") > println("data frame dl:") > dl.explain > println("data frame dr:") > dr.explain > val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1) > j.explain > j.show(false) > {code} > {code} > data frame dl: > == Physical Plan == > LocalTableScan [id#10] > data frame dr: > == Physical Plan == > LocalTableScan [id#16] > == Physical Plan == > CollectLimit 1 > +- SortMergeJoin [id#10], [id#16], FullOuter >:- *Sort [id#10 ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#10, 200) >: +- *LocalLimit 1 >:+- LocalTableScan [id#10] >+- *Sort [id#16 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#16, 200) > +- LocalTableScan [id#16] > import scala.util.Random._ > dl: org.apache.spark.sql.DataFrame = [id: int] > dr: org.apache.spark.sql.DataFrame = [id: int] > j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int] > ++---+ > |id |id | > ++---+ > |null|148| > ++---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22429) Streaming checkpointing code does not retry after failure due to NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-22429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-22429: - Component/s: (was: Structured Streaming) DStreams > Streaming checkpointing code does not retry after failure due to > NullPointerException > - > > Key: SPARK-22429 > URL: https://issues.apache.org/jira/browse/SPARK-22429 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.3, 2.2.0 >Reporter: Tristan Stevens > > CheckpointWriteHandler has a built in retry mechanism. However > SPARK-14930/SPARK-13693 put in a fix to de-allocate the fs object, yet > initialises it in the wrong place for the while loop, and therefore on > attempt 2 it fails with a NPE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22147) BlockId.hashCode allocates a StringBuilder/String on each call
[ https://issues.apache.org/jira/browse/SPARK-22147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236789#comment-16236789 ] Apache Spark commented on SPARK-22147: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/19646 > BlockId.hashCode allocates a StringBuilder/String on each call > -- > > Key: SPARK-22147 > URL: https://issues.apache.org/jira/browse/SPARK-22147 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.2.0 >Reporter: Sergei Lebedev >Assignee: Sergei Lebedev >Priority: Minor > Fix For: 2.3.0 > > > The base class {{BlockId}} > [defines|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockId.scala#L44] > {{hashCode}} and {{equals}} for all its subclasses in terms of {{name}}. > This makes the definitions of different ID types [very > concise|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockId.scala#L52]. > The downside, however, is redundant allocations. While I don't think this > could be the major issue, it is still a bit disappointing to increase GC > pressure on the driver for nothing. For our machine learning workloads, we've > seen as much as 10% of all allocations on the driver coming from > {{BlockId.hashCode}} calls done for > [BlockManagerMasterEndpoint.blockLocations|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L54]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
[ https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22306: Fix Version/s: 2.3.0 > INFER_AND_SAVE overwrites important metadata in Parquet Metastore table > --- > > Key: SPARK-22306 > URL: https://issues.apache.org/jira/browse/SPARK-22306 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet) > Spark 2.2.0 >Reporter: David Malinge >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.2.1, 2.3.0 > > > I noticed some critical changes on my hive tables and realized that they were > caused by a simple select on SparkSQL. Looking at the logs, I found out that > this select was actually performing an update on the database "Saving > case-sensitive schema for table". > I then found out that Spark 2.2.0 introduces a new default value for > spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE > The issue is that this update changes critical metadata of the table, in > particular: > - changes the owner to the current user > - removes bucketing metadata (BUCKETING_COLS, SDS) > - removes sorting metadata (SORT_COLS) > Switching the property to: NEVER_INFER prevents the issue. > Also, note that the damage can be fix manually in Hive with e.g.: > {code:sql} > alter table [table_name] > clustered by ([col1], [col2]) > sorted by ([colA], [colB]) > into [n] buckets > {code} > *REPRODUCE (branch-2.2)* > In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch > is good due to SPARK-17729. This is a regression on Spark 2.2 only. By > default, Parquet Hive table is affected and only Hive may suffer from this. > {code} > hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) > INTO 10 BUCKETS STORED AS PARQUET; > hive> INSERT INTO t VALUES('a','b'); > hive> DESC FORMATTED t; > ... > Num Buckets: 10 > Bucket Columns: [a, b] > Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)] > scala> sql("SELECT * FROM t").show(false) > hive> DESC FORMATTED t; > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22432) Allow long creation site to be logged for RDDs
Michael Mior created SPARK-22432: Summary: Allow long creation site to be logged for RDDs Key: SPARK-22432 URL: https://issues.apache.org/jira/browse/SPARK-22432 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0 Reporter: Michael Mior Would be interested in adding an option to store the long version of the RDD call site in the {{RDDInfo}} structure as opposed to the short one. This would allow the long version to appear in the event logs and the Spark UI and would be useful for debugging. I'm happy to submit a patch for this. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22431) Creating Permanent view with illegal type
Herman van Hovell created SPARK-22431: - Summary: Creating Permanent view with illegal type Key: SPARK-22431 URL: https://issues.apache.org/jira/browse/SPARK-22431 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Herman van Hovell Priority: Major It is possible in Spark SQL to create a permanent view that uses an nested field with an illegal name. For example if we create the following view: {noformat} create view x as select struct('a' as `$q`, 1 as b) q {noformat} A simple select fails with the following exception: {noformat} select * from x; org.apache.spark.SparkException: Cannot recognize hive type string: struct<$q:string,b:int> at org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378) ... {noformat} Dropping the view isn't possible either: {noformat} drop view x; org.apache.spark.SparkException: Cannot recognize hive type string: struct<$q:string,b:int> at org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378) ... {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22430) Unknown tag warnings when building R docs with Roxygen 6.0.1
Joel Croteau created SPARK-22430: Summary: Unknown tag warnings when building R docs with Roxygen 6.0.1 Key: SPARK-22430 URL: https://issues.apache.org/jira/browse/SPARK-22430 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 2.3.0 Environment: Roxygen 6.0.1 Reporter: Joel Croteau When building R docs using create-rd.sh with Roxygen 6.0.1, a large number of unknown tag warnings are generated: {noformat} Warning: @export [schema.R#33]: unknown tag Warning: @export [schema.R#53]: unknown tag Warning: @export [schema.R#63]: unknown tag Warning: @export [schema.R#80]: unknown tag Warning: @export [schema.R#123]: unknown tag Warning: @export [schema.R#141]: unknown tag Warning: @export [schema.R#216]: unknown tag Warning: @export [generics.R#388]: unknown tag Warning: @export [generics.R#403]: unknown tag Warning: @export [generics.R#407]: unknown tag Warning: @export [generics.R#414]: unknown tag Warning: @export [generics.R#418]: unknown tag Warning: @export [generics.R#422]: unknown tag Warning: @export [generics.R#428]: unknown tag Warning: @export [generics.R#432]: unknown tag Warning: @export [generics.R#438]: unknown tag Warning: @export [generics.R#442]: unknown tag Warning: @export [generics.R#446]: unknown tag Warning: @export [generics.R#450]: unknown tag Warning: @export [generics.R#454]: unknown tag Warning: @export [generics.R#459]: unknown tag Warning: @export [generics.R#467]: unknown tag Warning: @export [generics.R#475]: unknown tag Warning: @export [generics.R#479]: unknown tag Warning: @export [generics.R#483]: unknown tag Warning: @export [generics.R#487]: unknown tag Warning: @export [generics.R#498]: unknown tag Warning: @export [generics.R#502]: unknown tag Warning: @export [generics.R#506]: unknown tag Warning: @export [generics.R#512]: unknown tag Warning: @export [generics.R#518]: unknown tag Warning: @export [generics.R#526]: unknown tag Warning: @export [generics.R#530]: unknown tag Warning: @export [generics.R#534]: unknown tag Warning: @export [generics.R#538]: unknown tag Warning: @export [generics.R#542]: unknown tag Warning: @export [generics.R#549]: unknown tag Warning: @export [generics.R#556]: unknown tag Warning: @export [generics.R#560]: unknown tag Warning: @export [generics.R#567]: unknown tag Warning: @export [generics.R#571]: unknown tag Warning: @export [generics.R#575]: unknown tag Warning: @export [generics.R#579]: unknown tag Warning: @export [generics.R#583]: unknown tag Warning: @export [generics.R#587]: unknown tag Warning: @export [generics.R#591]: unknown tag Warning: @export [generics.R#595]: unknown tag Warning: @export [generics.R#599]: unknown tag Warning: @export [generics.R#603]: unknown tag Warning: @export [generics.R#607]: unknown tag Warning: @export [generics.R#611]: unknown tag Warning: @export [generics.R#615]: unknown tag Warning: @export [generics.R#619]: unknown tag Warning: @export [generics.R#623]: unknown tag Warning: @export [generics.R#627]: unknown tag Warning: @export [generics.R#631]: unknown tag Warning: @export [generics.R#635]: unknown tag Warning: @export [generics.R#639]: unknown tag Warning: @export [generics.R#643]: unknown tag Warning: @export [generics.R#647]: unknown tag Warning: @export [generics.R#654]: unknown tag Warning: @export [generics.R#658]: unknown tag Warning: @export [generics.R#663]: unknown tag Warning: @export [generics.R#667]: unknown tag Warning: @export [generics.R#672]: unknown tag Warning: @export [generics.R#676]: unknown tag Warning: @export [generics.R#680]: unknown tag Warning: @export [generics.R#684]: unknown tag Warning: @export [generics.R#690]: unknown tag Warning: @export [generics.R#696]: unknown tag Warning: @export [generics.R#702]: unknown tag Warning: @export [generics.R#706]: unknown tag Warning: @export [generics.R#710]: unknown tag Warning: @export [generics.R#716]: unknown tag Warning: @export [generics.R#720]: unknown tag Warning: @export [generics.R#726]: unknown tag Warning: @export [generics.R#730]: unknown tag Warning: @export [generics.R#734]: unknown tag Warning: @export [generics.R#738]: unknown tag Warning: @export [generics.R#742]: unknown tag Warning: @export [generics.R#750]: unknown tag Warning: @export [generics.R#754]: unknown tag Warning: @export [generics.R#758]: unknown tag Warning: @export [generics.R#766]: unknown tag Warning: @export [generics.R#770]: unknown tag Warning: @export [generics.R#774]: unknown tag Warning: @export [generics.R#778]: unknown tag Warning: @export [generics.R#782]: unknown tag Warning: @export [generics.R#786]: unknown tag Warning: @export [generics.R#790]: unknown tag Warning: @export [generics.R#794]: unknown tag Warning: @export [generics.R#799]: unknown tag Warning: @export [generics.R#803]: unknown tag Warning: @export [generics.R#807]: unknown tag Warning: @export [g
[jira] [Commented] (SPARK-22429) Streaming checkpointing code does not retry after failure due to NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-22429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236492#comment-16236492 ] Tristan Stevens commented on SPARK-22429: - [~srowen] I've raised a PR against branch-2.2. master would not compile for me (before I made changes), but the patch should apply cleanly on there too. > Streaming checkpointing code does not retry after failure due to > NullPointerException > - > > Key: SPARK-22429 > URL: https://issues.apache.org/jira/browse/SPARK-22429 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 1.6.3, 2.2.0 >Reporter: Tristan Stevens > > CheckpointWriteHandler has a built in retry mechanism. However > SPARK-14930/SPARK-13693 put in a fix to de-allocate the fs object, yet > initialises it in the wrong place for the while loop, and therefore on > attempt 2 it fails with a NPE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22429) Streaming checkpointing code does not retry after failure due to NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-22429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22429: Assignee: Apache Spark > Streaming checkpointing code does not retry after failure due to > NullPointerException > - > > Key: SPARK-22429 > URL: https://issues.apache.org/jira/browse/SPARK-22429 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 1.6.3, 2.2.0 >Reporter: Tristan Stevens >Assignee: Apache Spark > > CheckpointWriteHandler has a built in retry mechanism. However > SPARK-14930/SPARK-13693 put in a fix to de-allocate the fs object, yet > initialises it in the wrong place for the while loop, and therefore on > attempt 2 it fails with a NPE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22429) Streaming checkpointing code does not retry after failure due to NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-22429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22429: Assignee: (was: Apache Spark) > Streaming checkpointing code does not retry after failure due to > NullPointerException > - > > Key: SPARK-22429 > URL: https://issues.apache.org/jira/browse/SPARK-22429 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 1.6.3, 2.2.0 >Reporter: Tristan Stevens > > CheckpointWriteHandler has a built in retry mechanism. However > SPARK-14930/SPARK-13693 put in a fix to de-allocate the fs object, yet > initialises it in the wrong place for the while loop, and therefore on > attempt 2 it fails with a NPE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22429) Streaming checkpointing code does not retry after failure due to NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-22429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236489#comment-16236489 ] Apache Spark commented on SPARK-22429: -- User 'tmgstevens' has created a pull request for this issue: https://github.com/apache/spark/pull/19645 > Streaming checkpointing code does not retry after failure due to > NullPointerException > - > > Key: SPARK-22429 > URL: https://issues.apache.org/jira/browse/SPARK-22429 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 1.6.3, 2.2.0 >Reporter: Tristan Stevens > > CheckpointWriteHandler has a built in retry mechanism. However > SPARK-14930/SPARK-13693 put in a fix to de-allocate the fs object, yet > initialises it in the wrong place for the while loop, and therefore on > attempt 2 it fails with a NPE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22401) Missing 2.1.2 tag in git
[ https://issues.apache.org/jira/browse/SPARK-22401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk reassigned SPARK-22401: --- Assignee: holdenk > Missing 2.1.2 tag in git > > > Key: SPARK-22401 > URL: https://issues.apache.org/jira/browse/SPARK-22401 > Project: Spark > Issue Type: Bug > Components: Build, Deploy >Affects Versions: 2.1.2 >Reporter: Brian Barker >Assignee: holdenk >Priority: Minor > Fix For: 2.1.2 > > > We only saw a 2.1.2-rc4 tag in git, no official release. The releases web > page shows 2.1.2 was released in October 9. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22401) Missing 2.1.2 tag in git
[ https://issues.apache.org/jira/browse/SPARK-22401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-22401. - Resolution: Fixed > Missing 2.1.2 tag in git > > > Key: SPARK-22401 > URL: https://issues.apache.org/jira/browse/SPARK-22401 > Project: Spark > Issue Type: Bug > Components: Build, Deploy >Affects Versions: 2.1.2 >Reporter: Brian Barker >Assignee: holdenk >Priority: Minor > Fix For: 2.1.2 > > > We only saw a 2.1.2-rc4 tag in git, no official release. The releases web > page shows 2.1.2 was released in October 9. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22401) Missing 2.1.2 tag in git
[ https://issues.apache.org/jira/browse/SPARK-22401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-22401: - Fix Version/s: 2.1.2 > Missing 2.1.2 tag in git > > > Key: SPARK-22401 > URL: https://issues.apache.org/jira/browse/SPARK-22401 > Project: Spark > Issue Type: Bug > Components: Build, Deploy >Affects Versions: 2.1.2 >Reporter: Brian Barker >Priority: Minor > Fix For: 2.1.2 > > > We only saw a 2.1.2-rc4 tag in git, no official release. The releases web > page shows 2.1.2 was released in October 9. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22401) Missing 2.1.2 tag in git
[ https://issues.apache.org/jira/browse/SPARK-22401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236407#comment-16236407 ] Holden Karau commented on SPARK-22401: -- Pushed, looking at the scripts they are all for tagging the RCs. > Missing 2.1.2 tag in git > > > Key: SPARK-22401 > URL: https://issues.apache.org/jira/browse/SPARK-22401 > Project: Spark > Issue Type: Bug > Components: Build, Deploy >Affects Versions: 2.1.2 >Reporter: Brian Barker >Priority: Minor > Fix For: 2.1.2 > > > We only saw a 2.1.2-rc4 tag in git, no official release. The releases web > page shows 2.1.2 was released in October 9. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20807) Add compression/decompression of data to ColumnVector
[ https://issues.apache.org/jira/browse/SPARK-20807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-20807. -- Resolution: Won't Fix > Add compression/decompression of data to ColumnVector > - > > Key: SPARK-20807 > URL: https://issues.apache.org/jira/browse/SPARK-20807 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > While current {{CachedBatch}} can compress data by using of of multiple > compression schemes, {{ColumnVector}} cannot compress data. It is mandatory > for table cache. > This JIRA adds compression/decompression to {{ColumnVector}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21505) A dynamic join operator to improve the join reliability
[ https://issues.apache.org/jira/browse/SPARK-21505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236365#comment-16236365 ] Zhan Zhang commented on SPARK-21505: Any comments on this feature? Do you think the design is OK? If so, we are going to submit a PR. > A dynamic join operator to improve the join reliability > --- > > Key: SPARK-21505 > URL: https://issues.apache.org/jira/browse/SPARK-21505 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0, 2.3.0, 3.0.0 >Reporter: Lin >Priority: Major > Labels: features > > As we know, hash join is more efficient than sort merge join. But today hash > join is not so widely used because it may fail with OutOfMemory (OOM) error > due to limited memory resource, data skew, statistics mis-estimation and so > on. For example, if we apply shuffle hash join on an uneven distributed > dataset, some partitions might be so large that we cannot make a Hash table > for this particular partition causing OOM error. When OOM happens, current > Spark technology will throw an Exception, resulting in job failure. On the > other hand, if sort-merge join is used, there will be shuffle, sorting and > extra spill, causing the degradation of the join. Considering the efficiency > of hash join, we want to propose a fallback mechanism to dynamically use hash > join or sort-merge join at runtime at task level to provide a more reliable > join operation. > This new dynamic join operator internally implements the logic of HashJoin, > Iterator Reconstruct, Sort, and MergeJoin. We show the process of this > dynamic join method as following: > HashJoin: We start from building Hash table on one side of join partitions. > If Hash table is built successfully, it would be the same as the current > ShuffledHashJoin operator. > Sort: If we fail to build Hash table due to the large partition size, we do > SortMergeJoin only on this partition. But we need to rebuild the When OOM > happens, a Hash table corresponding to partial part of this partition has > been built successfully (e.g. first 4000 rows of RDD), and the iterator of > this partition is now pointing to the 4001st row of partition. We reuse this > hash table to reconstruct the iterator for the first 4000 rows and > concatenate with the rest rows of this partition so that we can rebuild this > partition completely. On this re-built partition, we apply sorting based on > key values. > MergeJoin: After getting two sorted Iterators, we perform regular merge join > against them and emits the records to downstream operators. > Iterator Reconstruct: BytesToBytesMap has to be spilled to disk to release > the memory for other operators, such as Sort, Join, etc. In addition, it has > to be converted to Iterator, so that it can be concatenated with remaining > items in the original iterator that is used to build the hash table. > Meta Data Population: Necessary metadata, such as sorting keys, jointype, > etc, has to be populated, so that they are used for potential Sort and > MergeJoin operator. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22243) streaming job failed to restart from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-22243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-22243: Assignee: StephenZou > streaming job failed to restart from checkpoint > --- > > Key: SPARK-22243 > URL: https://issues.apache.org/jira/browse/SPARK-22243 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0, 2.2.0 >Reporter: StephenZou >Assignee: StephenZou >Priority: Major > Fix For: 2.3.0 > > Attachments: CheckpointTest.scala > > > My spark-defaults.conf has an item related to the issue, I upload all jars in > spark's jars folder to the hdfs path: > spark.yarn.jars hdfs:///spark/cache/spark2.2/* > Streaming job failed to restart from checkpoint, ApplicationMaster throws > "Error: Could not find or load main class > org.apache.spark.deploy.yarn.ExecutorLauncher". The problem is always > reproducible. > I examine the sparkconf object recovered from checkpoint, and find > spark.yarn.jars are set empty, which let all jars not exist in AM side. The > solution is spark.yarn.jars should be reload from properties files when > recovering from checkpoint. > attach is a demo to reproduce the issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22243) streaming job failed to restart from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-22243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-22243. -- Resolution: Fixed Fix Version/s: 2.3.0 > streaming job failed to restart from checkpoint > --- > > Key: SPARK-22243 > URL: https://issues.apache.org/jira/browse/SPARK-22243 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0, 2.2.0 >Reporter: StephenZou >Priority: Major > Fix For: 2.3.0 > > Attachments: CheckpointTest.scala > > > My spark-defaults.conf has an item related to the issue, I upload all jars in > spark's jars folder to the hdfs path: > spark.yarn.jars hdfs:///spark/cache/spark2.2/* > Streaming job failed to restart from checkpoint, ApplicationMaster throws > "Error: Could not find or load main class > org.apache.spark.deploy.yarn.ExecutorLauncher". The problem is always > reproducible. > I examine the sparkconf object recovered from checkpoint, and find > spark.yarn.jars are set empty, which let all jars not exist in AM side. The > solution is spark.yarn.jars should be reload from properties files when > recovering from checkpoint. > attach is a demo to reproduce the issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22429) Streaming checkpointing code does not retry after failure due to NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-22429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236258#comment-16236258 ] Sean Owen commented on SPARK-22429: --- Sounds straightforward -- feel free to open a pull request. > Streaming checkpointing code does not retry after failure due to > NullPointerException > - > > Key: SPARK-22429 > URL: https://issues.apache.org/jira/browse/SPARK-22429 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 1.6.3, 2.2.0 >Reporter: Tristan Stevens > > CheckpointWriteHandler has a built in retry mechanism. However > SPARK-14930/SPARK-13693 put in a fix to de-allocate the fs object, yet > initialises it in the wrong place for the while loop, and therefore on > attempt 2 it fails with a NPE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22416) Move OrcOptions from `sql/hive` to `sql/core`
[ https://issues.apache.org/jira/browse/SPARK-22416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22416. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19636 [https://github.com/apache/spark/pull/19636] > Move OrcOptions from `sql/hive` to `sql/core` > - > > Key: SPARK-22416 > URL: https://issues.apache.org/jira/browse/SPARK-22416 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun >Priority: Minor > Fix For: 2.3.0 > > > According to the > [discussion|https://github.com/apache/spark/pull/19571#issuecomment-339472976] > on SPARK-15474, we will add new OrcFileFormat in `sql/core` module. > For that, `OrcOptions` should be visible like `private[sql]` in `sql/core` > module, too. Previously, it was `private[orc]` in `sql/hive`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22416) Move OrcOptions from `sql/hive` to `sql/core`
[ https://issues.apache.org/jira/browse/SPARK-22416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22416: --- Assignee: Dongjoon Hyun > Move OrcOptions from `sql/hive` to `sql/core` > - > > Key: SPARK-22416 > URL: https://issues.apache.org/jira/browse/SPARK-22416 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.3.0 > > > According to the > [discussion|https://github.com/apache/spark/pull/19571#issuecomment-339472976] > on SPARK-15474, we will add new OrcFileFormat in `sql/core` module. > For that, `OrcOptions` should be visible like `private[sql]` in `sql/core` > module, too. Previously, it was `private[orc]` in `sql/hive`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22254) clean up the implementation of `growToSize` in CompactBuffer
[ https://issues.apache.org/jira/browse/SPARK-22254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236105#comment-16236105 ] Kazuaki Ishizaki commented on SPARK-22254: -- I started working for this, and will submit a PR within a few days. > clean up the implementation of `growToSize` in CompactBuffer > > > Key: SPARK-22254 > URL: https://issues.apache.org/jira/browse/SPARK-22254 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Feng Liu >Priority: Major > > two issues: > 1. the arrayMax should be `ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH ` > 2. I believe some `-2` were introduced because `Integer.Max_Value` was used > previously. We should make the calculation of newArrayLen concise. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22344) Prevent R CMD check from using /tmp
[ https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236103#comment-16236103 ] Felix Cheung commented on SPARK-22344: -- Yes to both. If SPARK_HOME is set before calling install.spark then we are not installing it. Boy it's getting complicated. > Prevent R CMD check from using /tmp > --- > > Key: SPARK-22344 > URL: https://issues.apache.org/jira/browse/SPARK-22344 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0 >Reporter: Shivaram Venkataraman >Priority: Major > > When R CMD check is run on the SparkR package it leaves behind files in /tmp > which is a violation of CRAN policy. We should instead write to Rtmpdir. > Notes from CRAN are below > {code} > Checking this leaves behind dirs >hive/$USER >$USER > and files named like >b4f6459b-0624-4100-8358-7aa7afbda757_resources > in /tmp, in violation of the CRAN Policy. > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22429) Streaming checkpointing code does not retry after failure due to NullPointerException
Tristan Stevens created SPARK-22429: --- Summary: Streaming checkpointing code does not retry after failure due to NullPointerException Key: SPARK-22429 URL: https://issues.apache.org/jira/browse/SPARK-22429 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.2.0, 1.6.3 Reporter: Tristan Stevens CheckpointWriteHandler has a built in retry mechanism. However SPARK-14930/SPARK-13693 put in a fix to de-allocate the fs object, yet initialises it in the wrong place for the while loop, and therefore on attempt 2 it fails with a NPE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22329) Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default
[ https://issues.apache.org/jira/browse/SPARK-22329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-22329. --- Resolution: Won't Fix > Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default > -- > > Key: SPARK-22329 > URL: https://issues.apache.org/jira/browse/SPARK-22329 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun >Priority: Critical > > In Spark 2.2.0, `spark.sql.hive.caseSensitiveInferenceMode` has a critical > issue. > - SPARK-19611 uses `INFER_AND_SAVE` at 2.2.0 since Spark 2.1.0 breaks some > Hive tables backed by case-sensitive data files. > bq. This situation will occur for any Hive table that wasn't created by Spark > or that was created prior to Spark 2.1.0. If a user attempts to run a query > over such a table containing a case-sensitive field name in the query > projection or in the query filter, the query will return 0 results in every > case. > - However, SPARK-22306 reports this also corrupts Hive Metastore schema by > removing bucketing information (BUCKETING_COLS, SORT_COLS) and changing owner. > - Since Spark 2.3.0 supports Bucketing, BUCKETING_COLS and SORT_COLS look > okay at least. However, we need to figure out the issue of changing owners. > Also, we cannot backport bucketing patch into `branch-2.2`. We need more > tests on before releasing 2.3.0. > Hive Metastore is a shared resource and Spark should not corrupt it by > default. This issue proposes to recover that option back to `NEVER_INFO` like > Spark 2.2.0 by default. Users can take a risk by enabling `INFER_AND_SAVE` by > themselves. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22369) PySpark: Document methods of spark.catalog interface
[ https://issues.apache.org/jira/browse/SPARK-22369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-22369. - Resolution: Fixed Assignee: Hyukjin Kwon Fix Version/s: 2.3.0 > PySpark: Document methods of spark.catalog interface > > > Key: SPARK-22369 > URL: https://issues.apache.org/jira/browse/SPARK-22369 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Andreas Maier >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.0 > > > The following methods from the {{spark.catalog}} interface are not documented. > {code:java} > $ pyspark > >>> dir(spark.catalog) > ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', > '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', > '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', > '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', > '__str__', '__subclasshook__', '__weakref__', '_jcatalog', '_jsparkSession', > '_reset', '_sparkSession', 'cacheTable', 'clearCache', 'createExternalTable', > 'createTable', 'currentDatabase', 'dropGlobalTempView', 'dropTempView', > 'isCached', 'listColumns', 'listDatabases', 'listFunctions', 'listTables', > 'recoverPartitions', 'refreshByPath', 'refreshTable', 'registerFunction', > 'setCurrentDatabase', 'uncacheTable'] > {code} > As a user I would like to have these methods documented on > http://spark.apache.org/docs/latest/api/python/pyspark.sql.html . Old methods > of the SQLContext (e.g. {{pyspark.sql.SQLContext.cacheTable()}} vs. > {{pyspark.sql.SparkSession.catalog.cacheTable()}} or > {{pyspark.sql.HiveContext.refreshTable()}} vs. > {{pyspark.sql.SparkSession.catalog.refreshTable()}} ) should point to the new > method. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22419) Hive and Hive Thriftserver jars missing from "without hadoop" build
[ https://issues.apache.org/jira/browse/SPARK-22419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235783#comment-16235783 ] Sean Owen commented on SPARK-22419: --- Yes, it's useful for future reference. Spark should work fine with 2.6 and later. Honestly, the existence of a 2.6/2.7 build is vestigial at this point. You should not need your own build, in the main. Making your own build might help version conflicts, but really you're looking at a log4j config issue in that case. > Hive and Hive Thriftserver jars missing from "without hadoop" build > --- > > Key: SPARK-22419 > URL: https://issues.apache.org/jira/browse/SPARK-22419 > Project: Spark > Issue Type: Question > Components: Build >Affects Versions: 2.1.1 >Reporter: Adam Kramer >Priority: Minor > > The "without hadoop" binary distribution does not have hive-related libraries > in the jars directory. This may be due to Hive being tied to major releases > of Hadoop. My project requires using Hadoop 2.8, so "without hadoop" version > seemed the best option. Should I use the make-distribution.sh instead? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22419) Hive and Hive Thriftserver jars missing from "without hadoop" build
[ https://issues.apache.org/jira/browse/SPARK-22419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235774#comment-16235774 ] Adam Kramer edited comment on SPARK-22419 at 11/2/17 2:01 PM: -- I'll assume it's on purpose for my stated reasons above. Apologies for not posting to the mailing list, but I have a feeling this could act as a good web reference from search, I rarely get results from the mailing list while troubleshooting in Google. Also, the documentation for using Spark with upgraded versions of Hadoop (e.g. 2.8) is definitely lacking or at best confusing (i.e. a binary version including a version of Hadoop libs can still be configured to use another version of Hadoop by following instruction from the "without hadoop" wiki page). I suspect those instructions are old, but when using SPARK_DIST_CLASSPATH to override the hadoop libraries you run into things like log4j.properties files being hijacked by Hadoop version that change your application logging altogether. My guess is that its something that likely worked well a while ago or in a very specific situation, thus requires a lot of trial and error. was (Author: adamjk): I'll assume it's on purpose for my stated reasons above. Apologies for not posting to the mailing list, but I have a feeling this could act as a good web reference from search, I rarely get results from the mailing list while troubleshooting in Google. Also, the documentation for using Spark with upgraded versions of Hadoop (e.g. 2.8) is definitely lacking or at best confusing (i.e. a binary version including a version of Hadoop libs can still be configured to use another version of Hadoop by following instruction from the "without hadoop" wiki page). I suspect those instructions are old, but when using SPARK_DIST_CLASSPATH to override the hadoop libraries you run into things like log4j.properties files being hijacked by Hadoop version that change your application logging altogether. My guess is that its something that likely worked well a while ago or in a very specific situation requires a lot of investigation. > Hive and Hive Thriftserver jars missing from "without hadoop" build > --- > > Key: SPARK-22419 > URL: https://issues.apache.org/jira/browse/SPARK-22419 > Project: Spark > Issue Type: Question > Components: Build >Affects Versions: 2.1.1 >Reporter: Adam Kramer >Priority: Minor > > The "without hadoop" binary distribution does not have hive-related libraries > in the jars directory. This may be due to Hive being tied to major releases > of Hadoop. My project requires using Hadoop 2.8, so "without hadoop" version > seemed the best option. Should I use the make-distribution.sh instead? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22419) Hive and Hive Thriftserver jars missing from "without hadoop" build
[ https://issues.apache.org/jira/browse/SPARK-22419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235774#comment-16235774 ] Adam Kramer commented on SPARK-22419: - I'll assume it's on purpose for my stated reasons above. Apologies for not posting to the mailing list, but I have a feeling this could act as a good web reference from search, I rarely get results from the mailing list while troubleshooting in Google. Also, the documentation for using Spark with upgraded versions of Hadoop (e.g. 2.8) is definitely lacking or at best confusing (i.e. a binary version including a version of Hadoop libs can still be configured to use another version of Hadoop by following instruction from the "without hadoop" wiki page). I suspect those instructions are old, but when using SPARK_DIST_CLASSPATH to override the hadoop libraries you run into things like log4j.properties files being hijacked by Hadoop version that change your application logging altogether. My guess is that its something that likely worked well a while ago or in a very specific situation requires a lot of investigation. > Hive and Hive Thriftserver jars missing from "without hadoop" build > --- > > Key: SPARK-22419 > URL: https://issues.apache.org/jira/browse/SPARK-22419 > Project: Spark > Issue Type: Question > Components: Build >Affects Versions: 2.1.1 >Reporter: Adam Kramer >Priority: Minor > > The "without hadoop" binary distribution does not have hive-related libraries > in the jars directory. This may be due to Hive being tied to major releases > of Hadoop. My project requires using Hadoop 2.8, so "without hadoop" version > seemed the best option. Should I use the make-distribution.sh instead? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
[ https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235725#comment-16235725 ] Apache Spark commented on SPARK-22306: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/19644 > INFER_AND_SAVE overwrites important metadata in Parquet Metastore table > --- > > Key: SPARK-22306 > URL: https://issues.apache.org/jira/browse/SPARK-22306 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet) > Spark 2.2.0 >Reporter: David Malinge >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.2.1 > > > I noticed some critical changes on my hive tables and realized that they were > caused by a simple select on SparkSQL. Looking at the logs, I found out that > this select was actually performing an update on the database "Saving > case-sensitive schema for table". > I then found out that Spark 2.2.0 introduces a new default value for > spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE > The issue is that this update changes critical metadata of the table, in > particular: > - changes the owner to the current user > - removes bucketing metadata (BUCKETING_COLS, SDS) > - removes sorting metadata (SORT_COLS) > Switching the property to: NEVER_INFER prevents the issue. > Also, note that the damage can be fix manually in Hive with e.g.: > {code:sql} > alter table [table_name] > clustered by ([col1], [col2]) > sorted by ([colA], [colB]) > into [n] buckets > {code} > *REPRODUCE (branch-2.2)* > In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch > is good due to SPARK-17729. This is a regression on Spark 2.2 only. By > default, Parquet Hive table is affected and only Hive may suffer from this. > {code} > hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) > INTO 10 BUCKETS STORED AS PARQUET; > hive> INSERT INTO t VALUES('a','b'); > hive> DESC FORMATTED t; > ... > Num Buckets: 10 > Bucket Columns: [a, b] > Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)] > scala> sql("SELECT * FROM t").show(false) > hive> DESC FORMATTED t; > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22145) Issues with driver re-starting on mesos (supervise)
[ https://issues.apache.org/jira/browse/SPARK-22145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22145. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19374 [https://github.com/apache/spark/pull/19374] > Issues with driver re-starting on mesos (supervise) > --- > > Key: SPARK-22145 > URL: https://issues.apache.org/jira/browse/SPARK-22145 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.3.0 >Reporter: Stavros Kontopoulos >Priority: Major > Fix For: 2.3.0 > > > There are two issues with driver re-starting on mesos using the supervise > flag: > - We need to add spark.mesos.driver.frameworkId to the reloaded properties > for checkpointing, otherwise the new frameworkId propagated by the dispatcher > will be overwritten by the checkpointed data. > - Unique driver task ids are not used by the dispatcher: > https://issues.apache.org/jira/browse/MESOS-4737 > https://issues.apache.org/jira/browse/MESOS-3070 > This issue is the same in principle as in the case with standalone mode where > the master needs to re-launch drivers with a new appId (driverId) to deal > with net partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22145) Issues with driver re-starting on mesos (supervise)
[ https://issues.apache.org/jira/browse/SPARK-22145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-22145: - Assignee: Stavros Kontopoulos > Issues with driver re-starting on mesos (supervise) > --- > > Key: SPARK-22145 > URL: https://issues.apache.org/jira/browse/SPARK-22145 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.3.0 >Reporter: Stavros Kontopoulos >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 2.3.0 > > > There are two issues with driver re-starting on mesos using the supervise > flag: > - We need to add spark.mesos.driver.frameworkId to the reloaded properties > for checkpointing, otherwise the new frameworkId propagated by the dispatcher > will be overwritten by the checkpointed data. > - Unique driver task ids are not used by the dispatcher: > https://issues.apache.org/jira/browse/MESOS-4737 > https://issues.apache.org/jira/browse/MESOS-3070 > This issue is the same in principle as in the case with standalone mode where > the master needs to re-launch drivers with a new appId (driverId) to deal > with net partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21725) spark thriftserver insert overwrite table partition select
[ https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido resolved SPARK-21725. - Resolution: Not A Bug > spark thriftserver insert overwrite table partition select > --- > > Key: SPARK-21725 > URL: https://issues.apache.org/jira/browse/SPARK-21725 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 > Environment: centos 6.7 spark 2.1 jdk8 >Reporter: xinzhang >Priority: Major > Labels: spark-sql > > use thriftserver create table with partitions. > session 1: > SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) > partitioned by (pt string) stored as parquet; > --ok > !exit > session 2: > SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) > partitioned by (pt string) stored as parquet; > --ok > !exit > session 3: > --connect the thriftserver > SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 > partition(pt='1') select count(1) count from tmp_11; > --ok > !exit > session 4(do it again): > --connect the thriftserver > SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 > partition(pt='1') select count(1) count from tmp_11; > --error > !exit > - > 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing > query, currentState RUNNING, > java.lang.reflect.InvocationTargetException > .. > .. > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move > source > hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053 > 512282-2/-ext-1/part-0 to destination > hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0 > at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644) > at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711) > at > org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403) > at > org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324) > ... 45 more > Caused by: java.io.IOException: Filesystem closed > > - > the doc about the parquet table desc here > http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files > Hive metastore Parquet table conversion > When reading from and writing to Hive metastore Parquet tables, Spark SQL > will try to use its own Parquet support instead of Hive SerDe for better > performance. This behavior is controlled by the > spark.sql.hive.convertMetastoreParquet configuration, and is turned on by > default. > I am confused the problem appear in the table(partitions) but it is ok with > table(with out partitions) . It means spark do not use its own parquet ? > Maybe someone give any suggest how could I avoid the issue? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22398) Partition directories with leading 0s cause wrong results
[ https://issues.apache.org/jira/browse/SPARK-22398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235710#comment-16235710 ] Marco Gaido commented on SPARK-22398: - [~viirya] I see your point. Thanks for your answer. > Partition directories with leading 0s cause wrong results > - > > Key: SPARK-22398 > URL: https://issues.apache.org/jira/browse/SPARK-22398 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bogdan Raducanu >Priority: Major > > Repro case: > {code} > spark.range(8).selectExpr("'0' || cast(id as string) as id", "id as > b").write.mode("overwrite").partitionBy("id").parquet("/tmp/bug1") > spark.read.parquet("/tmp/bug1").where("id in ('01')").show > +---+---+ > | b| id| > +---+---+ > +---+---+ > spark.read.parquet("/tmp/bug1").where("id = '01'").show > +---+---+ > | b| id| > +---+---+ > | 1| 1| > +---+---+ > {code} > I think somewhere there is some special handling of this case for equals but > not the same for IN. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22398) Partition directories with leading 0s cause wrong results
[ https://issues.apache.org/jira/browse/SPARK-22398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235708#comment-16235708 ] Marco Gaido commented on SPARK-22398: - [~hyukjin.kwon] I think that here there are two points: 1) partition with leading 0s are interpreted as integers (and I think this is a wrong behavior, but it can be fixed disabling typeInference) 2) IN type coercion with literals behaves differently from type coercion in other parts. Due to the title of the JIRA I thought that the best option was to track 1 here and open a new JIRA with a relevant title for 2. > Partition directories with leading 0s cause wrong results > - > > Key: SPARK-22398 > URL: https://issues.apache.org/jira/browse/SPARK-22398 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bogdan Raducanu >Priority: Major > > Repro case: > {code} > spark.range(8).selectExpr("'0' || cast(id as string) as id", "id as > b").write.mode("overwrite").partitionBy("id").parquet("/tmp/bug1") > spark.read.parquet("/tmp/bug1").where("id in ('01')").show > +---+---+ > | b| id| > +---+---+ > +---+---+ > spark.read.parquet("/tmp/bug1").where("id = '01'").show > +---+---+ > | b| id| > +---+---+ > | 1| 1| > +---+---+ > {code} > I think somewhere there is some special handling of this case for equals but > not the same for IN. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22408) RelationalGroupedDataset's distinct pivot value calculation launches unnecessary stages
[ https://issues.apache.org/jira/browse/SPARK-22408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-22408. - Resolution: Fixed Assignee: Patrick Woody Fix Version/s: 2.3.0 > RelationalGroupedDataset's distinct pivot value calculation launches > unnecessary stages > --- > > Key: SPARK-22408 > URL: https://issues.apache.org/jira/browse/SPARK-22408 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Patrick Woody >Assignee: Patrick Woody > Fix For: 2.3.0 > > > When calculating the distinct values for a pivot in RelationalGroupedDataset > (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala#L322), > we sort before doing a take(maxValues + 1). > We should be able to improve this by adding a global limit before the sort, > which should reduce the work of the sort, and by simply doing a collect to > avoid multiple launching multiple stages as a part of the take. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11421) Add the ability to add a jar to the current class loader
[ https://issues.apache.org/jira/browse/SPARK-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235649#comment-16235649 ] Apache Spark commented on SPARK-11421: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/19643 > Add the ability to add a jar to the current class loader > > > Key: SPARK-11421 > URL: https://issues.apache.org/jira/browse/SPARK-11421 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: holdenk >Priority: Minor > > addJar add's jars for future operations, but could also add to the current > class loader, this would be really useful in Python & R most likely where > some included python code may wish to add some jars. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
[ https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22306. - Resolution: Fixed Fix Version/s: 2.2.1 Issue resolved by pull request 19622 [https://github.com/apache/spark/pull/19622] > INFER_AND_SAVE overwrites important metadata in Parquet Metastore table > --- > > Key: SPARK-22306 > URL: https://issues.apache.org/jira/browse/SPARK-22306 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet) > Spark 2.2.0 >Reporter: David Malinge >Priority: Critical > Fix For: 2.2.1 > > > I noticed some critical changes on my hive tables and realized that they were > caused by a simple select on SparkSQL. Looking at the logs, I found out that > this select was actually performing an update on the database "Saving > case-sensitive schema for table". > I then found out that Spark 2.2.0 introduces a new default value for > spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE > The issue is that this update changes critical metadata of the table, in > particular: > - changes the owner to the current user > - removes bucketing metadata (BUCKETING_COLS, SDS) > - removes sorting metadata (SORT_COLS) > Switching the property to: NEVER_INFER prevents the issue. > Also, note that the damage can be fix manually in Hive with e.g.: > {code:sql} > alter table [table_name] > clustered by ([col1], [col2]) > sorted by ([colA], [colB]) > into [n] buckets > {code} > *REPRODUCE (branch-2.2)* > In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch > is good due to SPARK-17729. This is a regression on Spark 2.2 only. By > default, Parquet Hive table is affected and only Hive may suffer from this. > {code} > hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) > INTO 10 BUCKETS STORED AS PARQUET; > hive> INSERT INTO t VALUES('a','b'); > hive> DESC FORMATTED t; > ... > Num Buckets: 10 > Bucket Columns: [a, b] > Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)] > scala> sql("SELECT * FROM t").show(false) > hive> DESC FORMATTED t; > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22306) INFER_AND_SAVE overwrites important metadata in Parquet Metastore table
[ https://issues.apache.org/jira/browse/SPARK-22306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22306: --- Assignee: Wenchen Fan > INFER_AND_SAVE overwrites important metadata in Parquet Metastore table > --- > > Key: SPARK-22306 > URL: https://issues.apache.org/jira/browse/SPARK-22306 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Hive 2.3.0 (PostgresQL metastore, stored as Parquet) > Spark 2.2.0 >Reporter: David Malinge >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.2.1 > > > I noticed some critical changes on my hive tables and realized that they were > caused by a simple select on SparkSQL. Looking at the logs, I found out that > this select was actually performing an update on the database "Saving > case-sensitive schema for table". > I then found out that Spark 2.2.0 introduces a new default value for > spark.sql.hive.caseSensitiveInferenceMode (see SPARK-20888): INFER_AND_SAVE > The issue is that this update changes critical metadata of the table, in > particular: > - changes the owner to the current user > - removes bucketing metadata (BUCKETING_COLS, SDS) > - removes sorting metadata (SORT_COLS) > Switching the property to: NEVER_INFER prevents the issue. > Also, note that the damage can be fix manually in Hive with e.g.: > {code:sql} > alter table [table_name] > clustered by ([col1], [col2]) > sorted by ([colA], [colB]) > into [n] buckets > {code} > *REPRODUCE (branch-2.2)* > In Spark 2.1.x (branch-2.1), NEVER_INFER is used. Spark 2.3 (master) branch > is good due to SPARK-17729. This is a regression on Spark 2.2 only. By > default, Parquet Hive table is affected and only Hive may suffer from this. > {code} > hive> CREATE TABLE t(a string, b string) CLUSTERED BY (a, b) SORTED BY (a, b) > INTO 10 BUCKETS STORED AS PARQUET; > hive> INSERT INTO t VALUES('a','b'); > hive> DESC FORMATTED t; > ... > Num Buckets: 10 > Bucket Columns: [a, b] > Sort Columns: [Order(col:a, order:1), Order(col:b, order:1)] > scala> sql("SELECT * FROM t").show(false) > hive> DESC FORMATTED t; > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22428) Document spark properties for configuring the ContextCleaner
[ https://issues.apache.org/jira/browse/SPARK-22428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235559#comment-16235559 ] Sean Owen commented on SPARK-22428: --- It's probably OK to do so, but not all properties are meant to be guaranteed and documented as an API. > Document spark properties for configuring the ContextCleaner > > > Key: SPARK-22428 > URL: https://issues.apache.org/jira/browse/SPARK-22428 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Andreas Maier >Priority: Minor > > The spark properties for configuring the ContextCleaner as described on > https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-service-contextcleaner.html > are not documented in the official documentation at > https://spark.apache.org/docs/latest/configuration.html#available-properties > . > As a user I would like to have the following spark properties documented in > the official documentation: > {code:java} > spark.cleaner.periodicGC.interval > spark.cleaner.referenceTracking > spark.cleaner.referenceTracking.blocking > spark.cleaner.referenceTracking.blocking.shuffle > spark.cleaner.referenceTracking.cleanCheckpoints > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22428) Document spark properties for configuring the ContextCleaner
Andreas Maier created SPARK-22428: - Summary: Document spark properties for configuring the ContextCleaner Key: SPARK-22428 URL: https://issues.apache.org/jira/browse/SPARK-22428 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.2.0 Reporter: Andreas Maier Priority: Minor The spark properties for configuring the ContextCleaner as described on https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-service-contextcleaner.html are not documented in the official documentation at https://spark.apache.org/docs/latest/configuration.html#available-properties . As a user I would like to have the following spark properties documented in the official documentation: {code:java} spark.cleaner.periodicGC.interval spark.cleaner.referenceTracking spark.cleaner.referenceTracking.blocking spark.cleaner.referenceTracking.blocking.shuffle spark.cleaner.referenceTracking.cleanCheckpoints {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22410) Excessive spill for Pyspark UDF when a row has shrunk
[ https://issues.apache.org/jira/browse/SPARK-22410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235522#comment-16235522 ] Apache Spark commented on SPARK-22410: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/19642 > Excessive spill for Pyspark UDF when a row has shrunk > - > > Key: SPARK-22410 > URL: https://issues.apache.org/jira/browse/SPARK-22410 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Reproduced on up-to-date master >Reporter: Clément Stenac >Priority: Minor > > Hi, > The following code processes 900KB of data and outputs around 2MB of data. > However, to process it, Spark needs to spill roughly 12 GB of data. > {code:python} > from pyspark.sql import SparkSession > from pyspark.sql.functions import * > from pyspark.sql.types import * > import json > ss = SparkSession.builder.getOrCreate() > # Create a few lines of data (5 lines). > # Each line is made of a string, and an array of 1 strings > # Total size of data is around 900 KB > lines_of_file = [ "this is a line" for x in xrange(1) ] > file_obj = [ "this_is_a_foldername/this_is_a_filename", lines_of_file ] > data = [ file_obj for x in xrange(5) ] > # Make a two-columns dataframe out of it > small_df = ss.sparkContext.parallelize(data).map(lambda x : (x[0], > x[1])).toDF(["file", "lines"]) > # We then explode the array, so we now have 5 rows in the dataframe, with > 2 columns, the 2nd > # column now has only "this is a line" as content > exploded = small_df.select("file", explode("lines")) > print("Exploded") > print(exploded.explain()) > # Now, just process it with a trivial Pyspark UDF that touches the first > column > # (the one which was not an array) > def split_key(s): > return s.split("/")[1] > split_key_udf = udf(split_key, StringType()) > with_filename = exploded.withColumn("filename", split_key_udf("file")) > # As expected, explain plan is very simple (BatchEval -> Explode -> Project > -> ScanExisting) > print(with_filename.explain()) > # Getting the head will spill around 12 GB of data > print(with_filename.head()) > {code} > The spill happens in the HybridRowQueue that is used to merge the part that > went through the Python worker and the part that didn't. > The problem comes from the fact that when it is added to the HybridRowQueue, > the UnsafeRow has a totalSizeInBytes of ~24 (seen by adding debug message > in HybridRowQueue), whereas, since it's after the explode, the actual size of > the row should be in the ~60 bytes range. > My understanding is that the row has retained the size it consumed *prior* to > the explode (at that time, the size of each of the 5 rows was indeed ~24 > bytes. > A workaround is to do exploded.cache() before calling the UDF. The fact of > going through the InMemoryColumnarTableScan "resets" the wrongful size of the > UnsafeRow. > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22410) Excessive spill for Pyspark UDF when a row has shrunk
[ https://issues.apache.org/jira/browse/SPARK-22410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22410: Assignee: Apache Spark > Excessive spill for Pyspark UDF when a row has shrunk > - > > Key: SPARK-22410 > URL: https://issues.apache.org/jira/browse/SPARK-22410 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Reproduced on up-to-date master >Reporter: Clément Stenac >Assignee: Apache Spark >Priority: Minor > > Hi, > The following code processes 900KB of data and outputs around 2MB of data. > However, to process it, Spark needs to spill roughly 12 GB of data. > {code:python} > from pyspark.sql import SparkSession > from pyspark.sql.functions import * > from pyspark.sql.types import * > import json > ss = SparkSession.builder.getOrCreate() > # Create a few lines of data (5 lines). > # Each line is made of a string, and an array of 1 strings > # Total size of data is around 900 KB > lines_of_file = [ "this is a line" for x in xrange(1) ] > file_obj = [ "this_is_a_foldername/this_is_a_filename", lines_of_file ] > data = [ file_obj for x in xrange(5) ] > # Make a two-columns dataframe out of it > small_df = ss.sparkContext.parallelize(data).map(lambda x : (x[0], > x[1])).toDF(["file", "lines"]) > # We then explode the array, so we now have 5 rows in the dataframe, with > 2 columns, the 2nd > # column now has only "this is a line" as content > exploded = small_df.select("file", explode("lines")) > print("Exploded") > print(exploded.explain()) > # Now, just process it with a trivial Pyspark UDF that touches the first > column > # (the one which was not an array) > def split_key(s): > return s.split("/")[1] > split_key_udf = udf(split_key, StringType()) > with_filename = exploded.withColumn("filename", split_key_udf("file")) > # As expected, explain plan is very simple (BatchEval -> Explode -> Project > -> ScanExisting) > print(with_filename.explain()) > # Getting the head will spill around 12 GB of data > print(with_filename.head()) > {code} > The spill happens in the HybridRowQueue that is used to merge the part that > went through the Python worker and the part that didn't. > The problem comes from the fact that when it is added to the HybridRowQueue, > the UnsafeRow has a totalSizeInBytes of ~24 (seen by adding debug message > in HybridRowQueue), whereas, since it's after the explode, the actual size of > the row should be in the ~60 bytes range. > My understanding is that the row has retained the size it consumed *prior* to > the explode (at that time, the size of each of the 5 rows was indeed ~24 > bytes. > A workaround is to do exploded.cache() before calling the UDF. The fact of > going through the InMemoryColumnarTableScan "resets" the wrongful size of the > UnsafeRow. > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22410) Excessive spill for Pyspark UDF when a row has shrunk
[ https://issues.apache.org/jira/browse/SPARK-22410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22410: Assignee: (was: Apache Spark) > Excessive spill for Pyspark UDF when a row has shrunk > - > > Key: SPARK-22410 > URL: https://issues.apache.org/jira/browse/SPARK-22410 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Reproduced on up-to-date master >Reporter: Clément Stenac >Priority: Minor > > Hi, > The following code processes 900KB of data and outputs around 2MB of data. > However, to process it, Spark needs to spill roughly 12 GB of data. > {code:python} > from pyspark.sql import SparkSession > from pyspark.sql.functions import * > from pyspark.sql.types import * > import json > ss = SparkSession.builder.getOrCreate() > # Create a few lines of data (5 lines). > # Each line is made of a string, and an array of 1 strings > # Total size of data is around 900 KB > lines_of_file = [ "this is a line" for x in xrange(1) ] > file_obj = [ "this_is_a_foldername/this_is_a_filename", lines_of_file ] > data = [ file_obj for x in xrange(5) ] > # Make a two-columns dataframe out of it > small_df = ss.sparkContext.parallelize(data).map(lambda x : (x[0], > x[1])).toDF(["file", "lines"]) > # We then explode the array, so we now have 5 rows in the dataframe, with > 2 columns, the 2nd > # column now has only "this is a line" as content > exploded = small_df.select("file", explode("lines")) > print("Exploded") > print(exploded.explain()) > # Now, just process it with a trivial Pyspark UDF that touches the first > column > # (the one which was not an array) > def split_key(s): > return s.split("/")[1] > split_key_udf = udf(split_key, StringType()) > with_filename = exploded.withColumn("filename", split_key_udf("file")) > # As expected, explain plan is very simple (BatchEval -> Explode -> Project > -> ScanExisting) > print(with_filename.explain()) > # Getting the head will spill around 12 GB of data > print(with_filename.head()) > {code} > The spill happens in the HybridRowQueue that is used to merge the part that > went through the Python worker and the part that didn't. > The problem comes from the fact that when it is added to the HybridRowQueue, > the UnsafeRow has a totalSizeInBytes of ~24 (seen by adding debug message > in HybridRowQueue), whereas, since it's after the explode, the actual size of > the row should be in the ~60 bytes range. > My understanding is that the row has retained the size it consumed *prior* to > the explode (at that time, the size of each of the 5 rows was indeed ~24 > bytes. > A workaround is to do exploded.cache() before calling the UDF. The fact of > going through the InMemoryColumnarTableScan "resets" the wrongful size of the > UnsafeRow. > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22426) Spark AM launching containers on node where External spark shuffle service failed to initialize
[ https://issues.apache.org/jira/browse/SPARK-22426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235500#comment-16235500 ] Prabhu Joseph commented on SPARK-22426: --- Node and NodeManager process is fine, External Spark Shuffle Service failed to initialize on that NodeManager for some reason like SPARK-17433, SPARK-15519 > Spark AM launching containers on node where External spark shuffle service > failed to initialize > --- > > Key: SPARK-22426 > URL: https://issues.apache.org/jira/browse/SPARK-22426 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph >Priority: Major > > When Spark External Shuffle Service on a NodeManager fails, the remote > executors will fail while fetching the data from the executors launched on > this Node. Spark ApplicationMaster should not launch containers on this Node > or not use external shuffle service. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22426) Spark AM launching containers on node where External spark shuffle service failed to initialize
[ https://issues.apache.org/jira/browse/SPARK-22426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235461#comment-16235461 ] Sean Owen commented on SPARK-22426: --- If the node has failed, YARN already can't or won't launch anything on that NodeManager. Are you saying something slightly different? > Spark AM launching containers on node where External spark shuffle service > failed to initialize > --- > > Key: SPARK-22426 > URL: https://issues.apache.org/jira/browse/SPARK-22426 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph >Priority: Major > > When Spark External Shuffle Service on a NodeManager fails, the remote > executors will fail while fetching the data from the executors launched on > this Node. Spark ApplicationMaster should not launch containers on this Node > or not use external shuffle service. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22427) StackOverFlowError when using FPGrowth
lyt created SPARK-22427: --- Summary: StackOverFlowError when using FPGrowth Key: SPARK-22427 URL: https://issues.apache.org/jira/browse/SPARK-22427 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 2.2.0 Environment: Centos Linux 3.10.0-327.el7.x86_64 java 1.8.0.111 spark 2.2.0 Reporter: lyt Priority: Normal code part: val path = jobConfig.getString("hdfspath") val vectordata = sc.sparkContext.textFile(path) val finaldata = sc.createDataset(vectordata.map(obj => { obj.split(" ") }).filter(arr => arr.length > 0)).toDF("items") val fpg = new FPGrowth() fpg.setMinSupport(minSupport).setItemsCol("items").setMinConfidence(minConfidence) val train = fpg.fit(finaldata) print(train.freqItemsets.count()) print(train.associationRules.count()) train.save("/tmp/FPGModel") And encountered following exception: Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.collect(RDD.scala:935) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278) at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430) at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429) at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836) at org.apache.spark.sql.Dataset.count(Dataset.scala:2429) at DataMining.FPGrowth$.runJob(FPGrowth.scala:116) at DataMining.testFPG$.main(FPGrowth.scala:36) at DataMining.testFPG.main(FPGrowth.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.StackOverflowError at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:616) at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:36) at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33) at com.esotericsoftware.kr
[jira] [Created] (SPARK-22426) Spark AM launching containers on node where External spark shuffle service failed to initialize
Prabhu Joseph created SPARK-22426: - Summary: Spark AM launching containers on node where External spark shuffle service failed to initialize Key: SPARK-22426 URL: https://issues.apache.org/jira/browse/SPARK-22426 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.6.3 Reporter: Prabhu Joseph Priority: Major When Spark External Shuffle Service on a NodeManager fails, the remote executors will fail while fetching the data from the executors launched on this Node. Spark ApplicationMaster should not launch containers on this Node or not use external shuffle service. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22426) Spark AM launching containers on node where External spark shuffle service failed to initialize
[ https://issues.apache.org/jira/browse/SPARK-22426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated SPARK-22426: -- Component/s: YARN > Spark AM launching containers on node where External spark shuffle service > failed to initialize > --- > > Key: SPARK-22426 > URL: https://issues.apache.org/jira/browse/SPARK-22426 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph >Priority: Major > > When Spark External Shuffle Service on a NodeManager fails, the remote > executors will fail while fetching the data from the executors launched on > this Node. Spark ApplicationMaster should not launch containers on this Node > or not use external shuffle service. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21911) Parallel Model Evaluation for ML Tuning: PySpark
[ https://issues.apache.org/jira/browse/SPARK-21911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235444#comment-16235444 ] Apache Spark commented on SPARK-21911: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/19641 > Parallel Model Evaluation for ML Tuning: PySpark > > > Key: SPARK-21911 > URL: https://issues.apache.org/jira/browse/SPARK-21911 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 2.3.0 > > > Add parallelism support for ML tuning in pyspark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22102) Reusing CliSessionState didn't set correct METASTOREWAREHOUSE
[ https://issues.apache.org/jira/browse/SPARK-22102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-22102. - Resolution: Cannot Reproduce Master branch cannot reproduce > Reusing CliSessionState didn't set correct METASTOREWAREHOUSE > - > > Key: SPARK-22102 > URL: https://issues.apache.org/jira/browse/SPARK-22102 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Yuming Wang >Priority: Major > > It shows the warehouse dir is > {{file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse}}, > but actually the warehouse dir is {{/user/hive/warehouse}} when create table. > {noformat} > [root@wangyuming01 spark-2.3.0-SNAPSHOT-bin-2.6.5]# bin/spark-sql > 17/09/22 21:32:40 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > log4j:WARN No appenders could be found for logger > (org.apache.hadoop.conf.Configuration). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more > info. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 17/09/22 21:32:45 INFO SparkContext: Running Spark version 2.3.0-SNAPSHOT > 17/09/22 21:32:45 INFO SparkContext: Submitted application: > SparkSQL::192.168.77.55 > 17/09/22 21:32:45 INFO SecurityManager: Changing view acls to: root > 17/09/22 21:32:45 INFO SecurityManager: Changing modify acls to: root > 17/09/22 21:32:45 INFO SecurityManager: Changing view acls groups to: > 17/09/22 21:32:45 INFO SecurityManager: Changing modify acls groups to: > 17/09/22 21:32:45 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(root); groups > with view permissions: Set(); users with modify permissions: Set(root); > groups with modify permissions: Set() > 17/09/22 21:32:45 INFO Utils: Successfully started service 'sparkDriver' on > port 43676. > 17/09/22 21:32:45 INFO SparkEnv: Registering MapOutputTracker > 17/09/22 21:32:45 INFO SparkEnv: Registering BlockManagerMaster > 17/09/22 21:32:45 INFO BlockManagerMasterEndpoint: Using > org.apache.spark.storage.DefaultTopologyMapper for getting topology > information > 17/09/22 21:32:45 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint > up > 17/09/22 21:32:45 INFO DiskBlockManager: Created local directory at > /tmp/blockmgr-f536509f-4e3e-4e08-ae7b-8d9499f8e4a4 > 17/09/22 21:32:45 INFO MemoryStore: MemoryStore started with capacity 366.3 MB > 17/09/22 21:32:45 INFO SparkEnv: Registering OutputCommitCoordinator > 17/09/22 21:32:45 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > 17/09/22 21:32:45 INFO Utils: Successfully started service 'SparkUI' on port > 4041. > 17/09/22 21:32:45 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at > http://wangyuming01:4041 > 17/09/22 21:32:45 INFO Executor: Starting executor ID driver on host localhost > 17/09/22 21:32:45 INFO Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44426. > 17/09/22 21:32:45 INFO NettyBlockTransferService: Server created on > wangyuming01:44426 > 17/09/22 21:32:45 INFO BlockManager: Using > org.apache.spark.storage.RandomBlockReplicationPolicy for block replication > policy > 17/09/22 21:32:45 INFO BlockManagerMaster: Registering BlockManager > BlockManagerId(driver, wangyuming01, 44426, None) > 17/09/22 21:32:45 INFO BlockManagerMasterEndpoint: Registering block manager > wangyuming01:44426 with 366.3 MB RAM, BlockManagerId(driver, wangyuming01, > 44426, None) > 17/09/22 21:32:45 INFO BlockManagerMaster: Registered BlockManager > BlockManagerId(driver, wangyuming01, 44426, None) > 17/09/22 21:32:45 INFO BlockManager: Initialized BlockManager: > BlockManagerId(driver, wangyuming01, 44426, None) > 17/09/22 21:32:45 INFO SharedState: Setting hive.metastore.warehouse.dir > ('null') to the value of spark.sql.warehouse.dir > ('file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse'). > 17/09/22 21:32:45 INFO SharedState: Warehouse path is > 'file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse'. > 17/09/22 21:32:46 INFO HiveUtils: Initializing HiveMetastoreConnection > version 1.2.1 using Spark classes. > 17/09/22 21:32:46 INFO HiveClientImpl: Warehouse location for Hive client > (version 1.2.2) is > file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse > 17/09/22 21:32:46 INFO metastore: Mestastore configuration > hive.metastore.warehouse.dir changed from /user/hive/warehouse to > file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse > 17/09/22 21:
[jira] [Reopened] (SPARK-22102) Reusing CliSessionState didn't set correct METASTOREWAREHOUSE
[ https://issues.apache.org/jira/browse/SPARK-22102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reopened SPARK-22102: - > Reusing CliSessionState didn't set correct METASTOREWAREHOUSE > - > > Key: SPARK-22102 > URL: https://issues.apache.org/jira/browse/SPARK-22102 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Yuming Wang >Priority: Major > > It shows the warehouse dir is > {{file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse}}, > but actually the warehouse dir is {{/user/hive/warehouse}} when create table. > {noformat} > [root@wangyuming01 spark-2.3.0-SNAPSHOT-bin-2.6.5]# bin/spark-sql > 17/09/22 21:32:40 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > log4j:WARN No appenders could be found for logger > (org.apache.hadoop.conf.Configuration). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more > info. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 17/09/22 21:32:45 INFO SparkContext: Running Spark version 2.3.0-SNAPSHOT > 17/09/22 21:32:45 INFO SparkContext: Submitted application: > SparkSQL::192.168.77.55 > 17/09/22 21:32:45 INFO SecurityManager: Changing view acls to: root > 17/09/22 21:32:45 INFO SecurityManager: Changing modify acls to: root > 17/09/22 21:32:45 INFO SecurityManager: Changing view acls groups to: > 17/09/22 21:32:45 INFO SecurityManager: Changing modify acls groups to: > 17/09/22 21:32:45 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(root); groups > with view permissions: Set(); users with modify permissions: Set(root); > groups with modify permissions: Set() > 17/09/22 21:32:45 INFO Utils: Successfully started service 'sparkDriver' on > port 43676. > 17/09/22 21:32:45 INFO SparkEnv: Registering MapOutputTracker > 17/09/22 21:32:45 INFO SparkEnv: Registering BlockManagerMaster > 17/09/22 21:32:45 INFO BlockManagerMasterEndpoint: Using > org.apache.spark.storage.DefaultTopologyMapper for getting topology > information > 17/09/22 21:32:45 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint > up > 17/09/22 21:32:45 INFO DiskBlockManager: Created local directory at > /tmp/blockmgr-f536509f-4e3e-4e08-ae7b-8d9499f8e4a4 > 17/09/22 21:32:45 INFO MemoryStore: MemoryStore started with capacity 366.3 MB > 17/09/22 21:32:45 INFO SparkEnv: Registering OutputCommitCoordinator > 17/09/22 21:32:45 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > 17/09/22 21:32:45 INFO Utils: Successfully started service 'SparkUI' on port > 4041. > 17/09/22 21:32:45 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at > http://wangyuming01:4041 > 17/09/22 21:32:45 INFO Executor: Starting executor ID driver on host localhost > 17/09/22 21:32:45 INFO Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44426. > 17/09/22 21:32:45 INFO NettyBlockTransferService: Server created on > wangyuming01:44426 > 17/09/22 21:32:45 INFO BlockManager: Using > org.apache.spark.storage.RandomBlockReplicationPolicy for block replication > policy > 17/09/22 21:32:45 INFO BlockManagerMaster: Registering BlockManager > BlockManagerId(driver, wangyuming01, 44426, None) > 17/09/22 21:32:45 INFO BlockManagerMasterEndpoint: Registering block manager > wangyuming01:44426 with 366.3 MB RAM, BlockManagerId(driver, wangyuming01, > 44426, None) > 17/09/22 21:32:45 INFO BlockManagerMaster: Registered BlockManager > BlockManagerId(driver, wangyuming01, 44426, None) > 17/09/22 21:32:45 INFO BlockManager: Initialized BlockManager: > BlockManagerId(driver, wangyuming01, 44426, None) > 17/09/22 21:32:45 INFO SharedState: Setting hive.metastore.warehouse.dir > ('null') to the value of spark.sql.warehouse.dir > ('file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse'). > 17/09/22 21:32:45 INFO SharedState: Warehouse path is > 'file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse'. > 17/09/22 21:32:46 INFO HiveUtils: Initializing HiveMetastoreConnection > version 1.2.1 using Spark classes. > 17/09/22 21:32:46 INFO HiveClientImpl: Warehouse location for Hive client > (version 1.2.2) is > file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse > 17/09/22 21:32:46 INFO metastore: Mestastore configuration > hive.metastore.warehouse.dir changed from /user/hive/warehouse to > file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse > 17/09/22 21:32:46 INFO HiveMetaStore: 0: Shutting down the object store... >
[jira] [Resolved] (SPARK-22102) Reusing CliSessionState didn't set correct METASTOREWAREHOUSE
[ https://issues.apache.org/jira/browse/SPARK-22102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-22102. - Resolution: Fixed > Reusing CliSessionState didn't set correct METASTOREWAREHOUSE > - > > Key: SPARK-22102 > URL: https://issues.apache.org/jira/browse/SPARK-22102 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Yuming Wang >Priority: Major > > It shows the warehouse dir is > {{file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse}}, > but actually the warehouse dir is {{/user/hive/warehouse}} when create table. > {noformat} > [root@wangyuming01 spark-2.3.0-SNAPSHOT-bin-2.6.5]# bin/spark-sql > 17/09/22 21:32:40 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > log4j:WARN No appenders could be found for logger > (org.apache.hadoop.conf.Configuration). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more > info. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 17/09/22 21:32:45 INFO SparkContext: Running Spark version 2.3.0-SNAPSHOT > 17/09/22 21:32:45 INFO SparkContext: Submitted application: > SparkSQL::192.168.77.55 > 17/09/22 21:32:45 INFO SecurityManager: Changing view acls to: root > 17/09/22 21:32:45 INFO SecurityManager: Changing modify acls to: root > 17/09/22 21:32:45 INFO SecurityManager: Changing view acls groups to: > 17/09/22 21:32:45 INFO SecurityManager: Changing modify acls groups to: > 17/09/22 21:32:45 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(root); groups > with view permissions: Set(); users with modify permissions: Set(root); > groups with modify permissions: Set() > 17/09/22 21:32:45 INFO Utils: Successfully started service 'sparkDriver' on > port 43676. > 17/09/22 21:32:45 INFO SparkEnv: Registering MapOutputTracker > 17/09/22 21:32:45 INFO SparkEnv: Registering BlockManagerMaster > 17/09/22 21:32:45 INFO BlockManagerMasterEndpoint: Using > org.apache.spark.storage.DefaultTopologyMapper for getting topology > information > 17/09/22 21:32:45 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint > up > 17/09/22 21:32:45 INFO DiskBlockManager: Created local directory at > /tmp/blockmgr-f536509f-4e3e-4e08-ae7b-8d9499f8e4a4 > 17/09/22 21:32:45 INFO MemoryStore: MemoryStore started with capacity 366.3 MB > 17/09/22 21:32:45 INFO SparkEnv: Registering OutputCommitCoordinator > 17/09/22 21:32:45 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > 17/09/22 21:32:45 INFO Utils: Successfully started service 'SparkUI' on port > 4041. > 17/09/22 21:32:45 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at > http://wangyuming01:4041 > 17/09/22 21:32:45 INFO Executor: Starting executor ID driver on host localhost > 17/09/22 21:32:45 INFO Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44426. > 17/09/22 21:32:45 INFO NettyBlockTransferService: Server created on > wangyuming01:44426 > 17/09/22 21:32:45 INFO BlockManager: Using > org.apache.spark.storage.RandomBlockReplicationPolicy for block replication > policy > 17/09/22 21:32:45 INFO BlockManagerMaster: Registering BlockManager > BlockManagerId(driver, wangyuming01, 44426, None) > 17/09/22 21:32:45 INFO BlockManagerMasterEndpoint: Registering block manager > wangyuming01:44426 with 366.3 MB RAM, BlockManagerId(driver, wangyuming01, > 44426, None) > 17/09/22 21:32:45 INFO BlockManagerMaster: Registered BlockManager > BlockManagerId(driver, wangyuming01, 44426, None) > 17/09/22 21:32:45 INFO BlockManager: Initialized BlockManager: > BlockManagerId(driver, wangyuming01, 44426, None) > 17/09/22 21:32:45 INFO SharedState: Setting hive.metastore.warehouse.dir > ('null') to the value of spark.sql.warehouse.dir > ('file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse'). > 17/09/22 21:32:45 INFO SharedState: Warehouse path is > 'file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse'. > 17/09/22 21:32:46 INFO HiveUtils: Initializing HiveMetastoreConnection > version 1.2.1 using Spark classes. > 17/09/22 21:32:46 INFO HiveClientImpl: Warehouse location for Hive client > (version 1.2.2) is > file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse > 17/09/22 21:32:46 INFO metastore: Mestastore configuration > hive.metastore.warehouse.dir changed from /user/hive/warehouse to > file:/root/create/spark/spark-2.3.0-SNAPSHOT-bin-2.6.5/spark-warehouse > 17/09/22 21:32:46 INFO HiveMetaStore: 0: Shutting down
[jira] [Commented] (SPARK-16986) "Started" time, "Completed" time and "Last Updated" time in history server UI are not user local time
[ https://issues.apache.org/jira/browse/SPARK-16986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235406#comment-16235406 ] Apache Spark commented on SPARK-16986: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/19640 > "Started" time, "Completed" time and "Last Updated" time in history server UI > are not user local time > - > > Key: SPARK-16986 > URL: https://issues.apache.org/jira/browse/SPARK-16986 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Weiqing Yang >Priority: Minor > > Currently, "Started" time, "Completed" time and "Last Updated" time in > history server UI are GMT. They should be the user local time. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22425) add output files information to EventLogger
Long Tian created SPARK-22425: - Summary: add output files information to EventLogger Key: SPARK-22425 URL: https://issues.apache.org/jira/browse/SPARK-22425 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0 Reporter: Long Tian Priority: Normal We can get all the input files from *EventLogger* when *spark.eventLog.enabled* is *true*. But there's no output files information. Is it possible to add some output files information to *EventLogger*? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs
[ https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235384#comment-16235384 ] chengning edited comment on SPARK-22424 at 11/2/17 8:41 AM: sorry, my picture not display, I post it again. [^1.jpg], as shows in this picture, the batch "2017/11/01 16:40:55" not finished was (Author: chengning): sorry, my picture not display, I post it again. [^1.jpg] > Task not finished for a long time in monitor UI, but I found it finished in > logs > > > Key: SPARK-22424 > URL: https://issues.apache.org/jira/browse/SPARK-22424 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: chengning >Priority: Blocking > Attachments: 1.jpg, 1.png, C33oL.jpg > > > Task not finished for a long time in monitor UI, but I found it finished in > logs > Thanks a lot. > !https://i.stack.imgur.com/C33oL.jpg! > !C33oL.jpg|thumbnail! > executor log: > 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 213492 > 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 > (TID 213492) > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 > non-empty blocks out of 30 blocks > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote > fetches in 1 ms > 17:32:28.447: tcPartition=7 ms > 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 > (TID 213492). 2755 bytes result sent to driver > driver log:: > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage > 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 > bytes) > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage > 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200) > 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose > tasks have all completed, from pool > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 > (foreachPartition at Counter2.java:152) finished in 0.255 s > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: > foreachPartition at Counter2.java:152, took 0.415256 s -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs
[ https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235384#comment-16235384 ] chengning edited comment on SPARK-22424 at 11/2/17 8:31 AM: sorry, my picture not display, I post it again. [^1.jpg] was (Author: chengning): !1.jpg|thumbnail! sorry, my picture not display, I post it again. > Task not finished for a long time in monitor UI, but I found it finished in > logs > > > Key: SPARK-22424 > URL: https://issues.apache.org/jira/browse/SPARK-22424 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: chengning >Priority: Blocking > Attachments: 1.jpg, 1.png, C33oL.jpg > > > Task not finished for a long time in monitor UI, but I found it finished in > logs > Thanks a lot. > !https://i.stack.imgur.com/C33oL.jpg! > !C33oL.jpg|thumbnail! > executor log: > 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 213492 > 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 > (TID 213492) > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 > non-empty blocks out of 30 blocks > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote > fetches in 1 ms > 17:32:28.447: tcPartition=7 ms > 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 > (TID 213492). 2755 bytes result sent to driver > driver log:: > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage > 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 > bytes) > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage > 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200) > 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose > tasks have all completed, from pool > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 > (foreachPartition at Counter2.java:152) finished in 0.255 s > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: > foreachPartition at Counter2.java:152, took 0.415256 s -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs
[ https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235384#comment-16235384 ] chengning commented on SPARK-22424: --- !1.jpg|thumbnail! sorry, my picture not display, I post it again. > Task not finished for a long time in monitor UI, but I found it finished in > logs > > > Key: SPARK-22424 > URL: https://issues.apache.org/jira/browse/SPARK-22424 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: chengning >Priority: Blocking > Attachments: 1.jpg, 1.png, C33oL.jpg > > > Task not finished for a long time in monitor UI, but I found it finished in > logs > Thanks a lot. > !https://i.stack.imgur.com/C33oL.jpg! > !C33oL.jpg|thumbnail! > executor log: > 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 213492 > 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 > (TID 213492) > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 > non-empty blocks out of 30 blocks > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote > fetches in 1 ms > 17:32:28.447: tcPartition=7 ms > 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 > (TID 213492). 2755 bytes result sent to driver > driver log:: > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage > 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 > bytes) > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage > 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200) > 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose > tasks have all completed, from pool > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 > (foreachPartition at Counter2.java:152) finished in 0.255 s > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: > foreachPartition at Counter2.java:152, took 0.415256 s -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs
[ https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chengning updated SPARK-22424: -- Attachment: 1.jpg > Task not finished for a long time in monitor UI, but I found it finished in > logs > > > Key: SPARK-22424 > URL: https://issues.apache.org/jira/browse/SPARK-22424 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: chengning >Priority: Blocking > Attachments: 1.jpg, 1.png, C33oL.jpg > > > Task not finished for a long time in monitor UI, but I found it finished in > logs > Thanks a lot. > !https://i.stack.imgur.com/C33oL.jpg! > !C33oL.jpg|thumbnail! > executor log: > 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 213492 > 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 > (TID 213492) > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 > non-empty blocks out of 30 blocks > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote > fetches in 1 ms > 17:32:28.447: tcPartition=7 ms > 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 > (TID 213492). 2755 bytes result sent to driver > driver log:: > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage > 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 > bytes) > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage > 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200) > 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose > tasks have all completed, from pool > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 > (foreachPartition at Counter2.java:152) finished in 0.255 s > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: > foreachPartition at Counter2.java:152, took 0.415256 s -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs
[ https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235379#comment-16235379 ] chengning edited comment on SPARK-22424 at 11/2/17 8:25 AM: I have another picture shows clearly !1.png|thumbnail! executor 17/11/01 16:40:55 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 640603 17/11/01 16:40:55 INFO executor.Executor: Running task 3.0 in stage 8218.0 (TID 640603) 17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Updating epoch to 2319 and clearing cache 17/11/01 16:40:55 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 8218 17/11/01 16:40:55 INFO memory.MemoryStore: Block broadcast_8218_piece0 stored as bytes in memory (estimated size 15.2 KB, free 2.2 GB) 17/11/01 16:40:55 INFO broadcast.TorrentBroadcast: Reading broadcast variable 8218 took 6 ms 17/11/01 16:40:55 INFO memory.MemoryStore: Block broadcast_8218 stored as values in memory (estimated size 31.5 KB, free 2.2 GB) 17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 2318, fetching them 17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@10.110.155.57:33084) 17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Got the output locations 17/11/01 16:40:55 INFO storage.ShuffleBlockFetcherIterator: Getting 28 non-empty blocks out of 30 blocks 17/11/01 16:40:55 INFO storage.ShuffleBlockFetcherIterator: Started 27 remote fetches in 3 ms 17/11/01 16:40:55 INFO codegen.CodeGenerator: Code generated in 21.652093 ms 17/11/01 16:40:55 INFO executor.Executor: Finished task 3.0 in stage 8218.0 (TID 640603). 3554 bytes result sent to driver 17/11/01 16:40:55 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 8218.0 (TID 640603, Letv6CU621YYPS, executor 12, partition 3, PROCESS_LOCAL, 6324 bytes) 17/11/01 16:40:55 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 8218.0 (TID 640603) in 167 ms on Letv6CU621YYPS (executor 12) (16/200) 17/11/01 16:40:55 ERROR scheduler.LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler. was (Author: chengning): I have another picture shows clearly !1.png|thumbnail! executor 17/11/01 16:40:55 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 640603 17/11/01 16:40:55 INFO executor.Executor: Running task 3.0 in stage 8218.0 (TID 640603) 17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Updating epoch to 2319 and clearing cache 17/11/01 16:40:55 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 8218 17/11/01 16:40:55 INFO memory.MemoryStore: Block broadcast_8218_piece0 stored as bytes in memory (estimated size 15.2 KB, free 2.2 GB) 17/11/01 16:40:55 INFO broadcast.TorrentBroadcast: Reading broadcast variable 8218 took 6 ms 17/11/01 16:40:55 INFO memory.MemoryStore: Block broadcast_8218 stored as values in memory (estimated size 31.5 KB, free 2.2 GB) 17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 2318, fetching them 17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@10.110.155.57:33084) 17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Got the output locations 17/11/01 16:40:55 INFO storage.ShuffleBlockFetcherIterator: Getting 28 non-empty blocks out of 30 blocks 17/11/01 16:40:55 INFO storage.ShuffleBlockFetcherIterator: Started 27 remote fetches in 3 ms 17/11/01 16:40:55 INFO codegen.CodeGenerator: Code generated in 21.652093 ms 17/11/01 16:40:55 INFO executor.Executor: Finished task 3.0 in stage 8218.0 (TID 640603). 3554 bytes result sent to driver 17/11/01 16:40:55 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 8218.0 (TID 640603, Letv6CU621YYPS, executor 12, partition 3, PROCESS_LOCAL, 6324 bytes) 17/11/01 16:40:55 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 8218.0 (TID 640603) in 167 ms on Letv6CU621YYPS (executor 12) (16/200) 17/11/01 16:40:55 ERROR scheduler.LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler. > Task not finished for a long time in monitor UI, but I found it finished in > logs > > > Key: SPARK-22424 > URL: https://issues.apache.org/jira/browse/SPARK-22424 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: chengning
[jira] [Commented] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs
[ https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235381#comment-16235381 ] Sean Owen commented on SPARK-22424: --- I'm not following. You're circling different tasks. But again the one you mention in the logs shows as completed in both places. > Task not finished for a long time in monitor UI, but I found it finished in > logs > > > Key: SPARK-22424 > URL: https://issues.apache.org/jira/browse/SPARK-22424 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: chengning >Priority: Blocking > Attachments: 1.png, C33oL.jpg > > > Task not finished for a long time in monitor UI, but I found it finished in > logs > Thanks a lot. > !https://i.stack.imgur.com/C33oL.jpg! > !C33oL.jpg|thumbnail! > executor log: > 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 213492 > 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 > (TID 213492) > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 > non-empty blocks out of 30 blocks > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote > fetches in 1 ms > 17:32:28.447: tcPartition=7 ms > 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 > (TID 213492). 2755 bytes result sent to driver > driver log:: > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage > 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 > bytes) > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage > 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200) > 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose > tasks have all completed, from pool > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 > (foreachPartition at Counter2.java:152) finished in 0.255 s > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: > foreachPartition at Counter2.java:152, took 0.415256 s -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs
[ https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235379#comment-16235379 ] chengning commented on SPARK-22424: --- I have another picture shows clearly !1.png|thumbnail! executor 17/11/01 16:40:55 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 640603 17/11/01 16:40:55 INFO executor.Executor: Running task 3.0 in stage 8218.0 (TID 640603) 17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Updating epoch to 2319 and clearing cache 17/11/01 16:40:55 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 8218 17/11/01 16:40:55 INFO memory.MemoryStore: Block broadcast_8218_piece0 stored as bytes in memory (estimated size 15.2 KB, free 2.2 GB) 17/11/01 16:40:55 INFO broadcast.TorrentBroadcast: Reading broadcast variable 8218 took 6 ms 17/11/01 16:40:55 INFO memory.MemoryStore: Block broadcast_8218 stored as values in memory (estimated size 31.5 KB, free 2.2 GB) 17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 2318, fetching them 17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@10.110.155.57:33084) 17/11/01 16:40:55 INFO spark.MapOutputTrackerWorker: Got the output locations 17/11/01 16:40:55 INFO storage.ShuffleBlockFetcherIterator: Getting 28 non-empty blocks out of 30 blocks 17/11/01 16:40:55 INFO storage.ShuffleBlockFetcherIterator: Started 27 remote fetches in 3 ms 17/11/01 16:40:55 INFO codegen.CodeGenerator: Code generated in 21.652093 ms 17/11/01 16:40:55 INFO executor.Executor: Finished task 3.0 in stage 8218.0 (TID 640603). 3554 bytes result sent to driver 17/11/01 16:40:55 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 8218.0 (TID 640603, Letv6CU621YYPS, executor 12, partition 3, PROCESS_LOCAL, 6324 bytes) 17/11/01 16:40:55 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 8218.0 (TID 640603) in 167 ms on Letv6CU621YYPS (executor 12) (16/200) 17/11/01 16:40:55 ERROR scheduler.LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler. > Task not finished for a long time in monitor UI, but I found it finished in > logs > > > Key: SPARK-22424 > URL: https://issues.apache.org/jira/browse/SPARK-22424 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: chengning >Priority: Blocking > Attachments: 1.png, C33oL.jpg > > > Task not finished for a long time in monitor UI, but I found it finished in > logs > Thanks a lot. > !https://i.stack.imgur.com/C33oL.jpg! > !C33oL.jpg|thumbnail! > executor log: > 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 213492 > 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 > (TID 213492) > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 > non-empty blocks out of 30 blocks > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote > fetches in 1 ms > 17:32:28.447: tcPartition=7 ms > 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 > (TID 213492). 2755 bytes result sent to driver > driver log:: > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage > 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 > bytes) > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage > 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200) > 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose > tasks have all completed, from pool > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 > (foreachPartition at Counter2.java:152) finished in 0.255 s > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: > foreachPartition at Counter2.java:152, took 0.415256 s -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs
[ https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chengning updated SPARK-22424: -- Attachment: 1.png > Task not finished for a long time in monitor UI, but I found it finished in > logs > > > Key: SPARK-22424 > URL: https://issues.apache.org/jira/browse/SPARK-22424 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: chengning >Priority: Blocking > Attachments: 1.png, C33oL.jpg > > > Task not finished for a long time in monitor UI, but I found it finished in > logs > Thanks a lot. > !https://i.stack.imgur.com/C33oL.jpg! > !C33oL.jpg|thumbnail! > executor log: > 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 213492 > 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 > (TID 213492) > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 > non-empty blocks out of 30 blocks > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote > fetches in 1 ms > 17:32:28.447: tcPartition=7 ms > 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 > (TID 213492). 2755 bytes result sent to driver > driver log:: > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage > 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 > bytes) > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage > 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200) > 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose > tasks have all completed, from pool > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 > (foreachPartition at Counter2.java:152) finished in 0.255 s > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: > foreachPartition at Counter2.java:152, took 0.415256 s -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs
[ https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235365#comment-16235365 ] chengning commented on SPARK-22424: --- Oh, I saw that the state is really SUCCESS, but Event Timeline show not execute, I guess it cause the batch "2017/09/29 17:32:28" not finished > Task not finished for a long time in monitor UI, but I found it finished in > logs > > > Key: SPARK-22424 > URL: https://issues.apache.org/jira/browse/SPARK-22424 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: chengning >Priority: Blocking > Attachments: C33oL.jpg > > > Task not finished for a long time in monitor UI, but I found it finished in > logs > Thanks a lot. > !https://i.stack.imgur.com/C33oL.jpg! > !C33oL.jpg|thumbnail! > executor log: > 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 213492 > 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 > (TID 213492) > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 > non-empty blocks out of 30 blocks > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote > fetches in 1 ms > 17:32:28.447: tcPartition=7 ms > 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 > (TID 213492). 2755 bytes result sent to driver > driver log:: > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage > 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 > bytes) > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage > 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200) > 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose > tasks have all completed, from pool > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 > (foreachPartition at Counter2.java:152) finished in 0.255 s > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: > foreachPartition at Counter2.java:152, took 0.415256 s -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs
[ https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chengning updated SPARK-22424: -- Description: Task not finished for a long time in monitor UI, but I found it finished in logs Thanks a lot. !https://i.stack.imgur.com/C33oL.jpg! !C33oL.jpg|thumbnail! executor log: 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 213492 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 (TID 213492) 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 non-empty blocks out of 30 blocks 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote fetches in 1 ms 17:32:28.447: tcPartition=7 ms 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 (TID 213492). 2755 bytes result sent to driver driver log:: 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 bytes) 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200) 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose tasks have all completed, from pool 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 (foreachPartition at Counter2.java:152) finished in 0.255 s 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: foreachPartition at Counter2.java:152, took 0.415256 s was: Task not finished for a long time in monitor UI, but I found it finished in logs Thanks a lot. !https://i.stack.imgur.com/C33oL.jpg! executor log: 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 213492 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 (TID 213492) 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 non-empty blocks out of 30 blocks 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote fetches in 1 ms 17:32:28.447: tcPartition=7 ms 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 (TID 213492). 2755 bytes result sent to driver driver log:: 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 bytes) 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200) 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose tasks have all completed, from pool 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 (foreachPartition at Counter2.java:152) finished in 0.255 s 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: foreachPartition at Counter2.java:152, took 0.415256 s > Task not finished for a long time in monitor UI, but I found it finished in > logs > > > Key: SPARK-22424 > URL: https://issues.apache.org/jira/browse/SPARK-22424 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: chengning >Priority: Blocking > Attachments: C33oL.jpg > > > Task not finished for a long time in monitor UI, but I found it finished in > logs > Thanks a lot. > !https://i.stack.imgur.com/C33oL.jpg! > !C33oL.jpg|thumbnail! > executor log: > 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 213492 > 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 > (TID 213492) > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 > non-empty blocks out of 30 blocks > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote > fetches in 1 ms > 17:32:28.447: tcPartition=7 ms > 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 > (TID 213492). 2755 bytes result sent to driver > driver log:: > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage > 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 > bytes) > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage > 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200) > 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose > tasks have all completed, from pool > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 > (foreachPartition at Counter2.java:152) finished in 0.255 s > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: > foreachPartition at Counter2.java:152, took 0.415256 s -- This message was sent by Atlassian JIRA (v6.4.14#64029) ---
[jira] [Updated] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs
[ https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chengning updated SPARK-22424: -- Attachment: C33oL.jpg > Task not finished for a long time in monitor UI, but I found it finished in > logs > > > Key: SPARK-22424 > URL: https://issues.apache.org/jira/browse/SPARK-22424 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: chengning >Priority: Blocking > Attachments: C33oL.jpg > > > Task not finished for a long time in monitor UI, but I found it finished in > logs > Thanks a lot. > !https://i.stack.imgur.com/C33oL.jpg! > executor log: > 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 213492 > 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 > (TID 213492) > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 > non-empty blocks out of 30 blocks > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote > fetches in 1 ms > 17:32:28.447: tcPartition=7 ms > 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 > (TID 213492). 2755 bytes result sent to driver > driver log:: > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage > 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 > bytes) > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage > 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200) > 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose > tasks have all completed, from pool > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 > (foreachPartition at Counter2.java:152) finished in 0.255 s > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: > foreachPartition at Counter2.java:152, took 0.415256 s -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs
[ https://issues.apache.org/jira/browse/SPARK-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235344#comment-16235344 ] Sean Owen commented on SPARK-22424: --- This shows task 52 finished in both logs and UI. > Task not finished for a long time in monitor UI, but I found it finished in > logs > > > Key: SPARK-22424 > URL: https://issues.apache.org/jira/browse/SPARK-22424 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: chengning >Priority: Blocking > Attachments: C33oL.jpg > > > Task not finished for a long time in monitor UI, but I found it finished in > logs > Thanks a lot. > !https://i.stack.imgur.com/C33oL.jpg! > executor log: > 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 213492 > 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 > (TID 213492) > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 > non-empty blocks out of 30 blocks > 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote > fetches in 1 ms > 17:32:28.447: tcPartition=7 ms > 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 > (TID 213492). 2755 bytes result sent to driver > driver log:: > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage > 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 > bytes) > 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage > 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200) > 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose > tasks have all completed, from pool > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 > (foreachPartition at Counter2.java:152) finished in 0.255 s > 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: > foreachPartition at Counter2.java:152, took 0.415256 s -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22424) Task not finished for a long time in monitor UI, but I found it finished in logs
chengning created SPARK-22424: - Summary: Task not finished for a long time in monitor UI, but I found it finished in logs Key: SPARK-22424 URL: https://issues.apache.org/jira/browse/SPARK-22424 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: chengning Priority: Blocking Task not finished for a long time in monitor UI, but I found it finished in logs Thanks a lot. !https://i.stack.imgur.com/C33oL.jpg! executor log: 17/09/29 17:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 213492 17/09/29 17:32:28 INFO executor.Executor: Running task 52.0 in stage 2468.0 (TID 213492) 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Getting 30 non-empty blocks out of 30 blocks 17/09/29 17:32:28 INFO storage.ShuffleBlockFetcherIterator: Started 29 remote fetches in 1 ms 17:32:28.447: tcPartition=7 ms 17/09/29 17:32:28 INFO executor.Executor: Finished task 52.0 in stage 2468.0 (TID 213492). 2755 bytes result sent to driver driver log:: 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Starting task 52.0 in stage 2468.0 (TID 213492, HMGQXD2, executor 1, partition 52, PROCESS_LOCAL, 6386 bytes) 17/09/29 17:32:28 INFO scheduler.TaskSetManager: Finished task 52.0 in stage 2468.0 (TID 213492) in 24 ms on HMGQXD2 (executor 1) (53/200) 17/09/29 17:32:28 INFO cluster.YarnScheduler: Removed TaskSet 2468.0, whose tasks have all completed, from pool 17/09/29 17:32:28 INFO scheduler.DAGScheduler: ResultStage 2468 (foreachPartition at Counter2.java:152) finished in 0.255 s 17/09/29 17:32:28 INFO scheduler.DAGScheduler: Job 1647 finished: foreachPartition at Counter2.java:152, took 0.415256 s -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22423) Scala test source files like TestHiveSingleton.scala should be in scala source root
[ https://issues.apache.org/jira/browse/SPARK-22423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-22423: -- Summary: Scala test source files like TestHiveSingleton.scala should be in scala source root (was: The TestHiveSingleton.scala file should be in scala directory) There are several files in the wrong tree. Could you try fixing all of these? ./mllib/src/test/java/org/apache/spark/ml/util/IdentifiableSuite.scala ./streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala ./streaming/src/test/java/org/apache/spark/streaming/api/java/JavaStreamingListenerWrapperSuite.scala ./sql/hive/src/test/java/org/apache/spark/sql/hive/test/TestHiveSingleton.scala > Scala test source files like TestHiveSingleton.scala should be in scala > source root > --- > > Key: SPARK-22423 > URL: https://issues.apache.org/jira/browse/SPARK-22423 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.2.0 >Reporter: xubo245 >Priority: Minor > > The TestHiveSingleton.scala file should be in scala directory, not in java > directory -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22423) The TestHiveSingleton.scala file should be in scala directory
[ https://issues.apache.org/jira/browse/SPARK-22423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22423: Assignee: Apache Spark > The TestHiveSingleton.scala file should be in scala directory > - > > Key: SPARK-22423 > URL: https://issues.apache.org/jira/browse/SPARK-22423 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.2.0 >Reporter: xubo245 >Assignee: Apache Spark >Priority: Minor > > The TestHiveSingleton.scala file should be in scala directory, not in java > directory -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22423) The TestHiveSingleton.scala file should be in scala directory
[ https://issues.apache.org/jira/browse/SPARK-22423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235336#comment-16235336 ] Apache Spark commented on SPARK-22423: -- User 'xubo245' has created a pull request for this issue: https://github.com/apache/spark/pull/19639 > The TestHiveSingleton.scala file should be in scala directory > - > > Key: SPARK-22423 > URL: https://issues.apache.org/jira/browse/SPARK-22423 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.2.0 >Reporter: xubo245 >Priority: Minor > > The TestHiveSingleton.scala file should be in scala directory, not in java > directory -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22423) The TestHiveSingleton.scala file should be in scala directory
[ https://issues.apache.org/jira/browse/SPARK-22423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22423: Assignee: (was: Apache Spark) > The TestHiveSingleton.scala file should be in scala directory > - > > Key: SPARK-22423 > URL: https://issues.apache.org/jira/browse/SPARK-22423 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.2.0 >Reporter: xubo245 >Priority: Minor > > The TestHiveSingleton.scala file should be in scala directory, not in java > directory -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22419) Hive and Hive Thriftserver jars missing from "without hadoop" build
[ https://issues.apache.org/jira/browse/SPARK-22419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22419. --- Resolution: Not A Problem Fix Version/s: (was: 2.1.1) This is on purpose anyway, and questions should go to the mailing list. > Hive and Hive Thriftserver jars missing from "without hadoop" build > --- > > Key: SPARK-22419 > URL: https://issues.apache.org/jira/browse/SPARK-22419 > Project: Spark > Issue Type: Question > Components: Build >Affects Versions: 2.1.1 >Reporter: Adam Kramer >Priority: Minor > > The "without hadoop" binary distribution does not have hive-related libraries > in the jars directory. This may be due to Hive being tied to major releases > of Hadoop. My project requires using Hadoop 2.8, so "without hadoop" version > seemed the best option. Should I use the make-distribution.sh instead? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org