[jira] [Commented] (SPARK-24288) Enable preventing predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559339#comment-16559339 ] Xiao Li commented on SPARK-24288: - [~hyukjin.kwon]I updated the JIRA description. > Enable preventing predicate pushdown > > > Key: SPARK-24288 > URL: https://issues.apache.org/jira/browse/SPARK-24288 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Tomasz Gawęda >Assignee: Maryann Xue >Priority: Major > Fix For: 2.4.0 > > Attachments: SPARK-24288.simple.patch > > > Issue discussed on Mailing List: > [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html] > While working with JDBC datasource I saw that many "or" clauses with > non-equality operators causes huge performance degradation of SQL query > to database (DB2). For example: > val df = spark.read.format("jdbc").(other options to parallelize > load).load() > df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x > > 100)").show() // in real application whose predicates were pushed > many many lines below, many ANDs and ORs > If I use cache() before where, there is no predicate pushdown of this > "where" clause. However, in production system caching many sources is a > waste of memory (especially is pipeline is long and I must do cache many > times).There are also few more workarounds, but it would be great if Spark > will support preventing predicate pushdown by user. > > For example: df.withAnalysisBarrier().where(...) ? > > Note, that this should not be a global configuration option. If I read 2 > DataFrames, df1 and df2, I would like to specify that df1 should not have > some predicates pushed down, but some may be, but df2 should have all > predicates pushed down, even if target query joins df1 and df2. As far as I > understand Spark optimizer, if we use functions like `withAnalysisBarrier` > and put AnalysisBarrier explicitly in logical plan, then predicates won't be > pushed down on this particular DataFrames and PP will be still possible on > the second one. > Update: The solution is to add a JDBC Option "pushDownPredicate" (default > true) to allow/disallow predicate push-down in JDBC data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24288) Enable preventing predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24288: Description: Issue discussed on Mailing List: [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html] While working with JDBC datasource I saw that many "or" clauses with non-equality operators causes huge performance degradation of SQL query to database (DB2). For example: val df = spark.read.format("jdbc").(other options to parallelize load).load() df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x > 100)").show() // in real application whose predicates were pushed many many lines below, many ANDs and ORs If I use cache() before where, there is no predicate pushdown of this "where" clause. However, in production system caching many sources is a waste of memory (especially is pipeline is long and I must do cache many times).There are also few more workarounds, but it would be great if Spark will support preventing predicate pushdown by user. For example: df.withAnalysisBarrier().where(...) ? Note, that this should not be a global configuration option. If I read 2 DataFrames, df1 and df2, I would like to specify that df1 should not have some predicates pushed down, but some may be, but df2 should have all predicates pushed down, even if target query joins df1 and df2. As far as I understand Spark optimizer, if we use functions like `withAnalysisBarrier` and put AnalysisBarrier explicitly in logical plan, then predicates won't be pushed down on this particular DataFrames and PP will be still possible on the second one. Update: The solution is to add a JDBC Option "pushDownPredicate" (default true) to allow/disallow predicate push-down in JDBC data source. was: Issue discussed on Mailing List: [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html] While working with JDBC datasource I saw that many "or" clauses with non-equality operators causes huge performance degradation of SQL query to database (DB2). For example: val df = spark.read.format("jdbc").(other options to parallelize load).load() df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x > 100)").show() // in real application whose predicates were pushed many many lines below, many ANDs and ORs If I use cache() before where, there is no predicate pushdown of this "where" clause. However, in production system caching many sources is a waste of memory (especially is pipeline is long and I must do cache many times).There are also few more workarounds, but it would be great if Spark will support preventing predicate pushdown by user. For example: df.withAnalysisBarrier().where(...) ? Note, that this should not be a global configuration option. If I read 2 DataFrames, df1 and df2, I would like to specify that df1 should not have some predicates pushed down, but some may be, but df2 should have all predicates pushed down, even if target query joins df1 and df2. As far as I understand Spark optimizer, if we use functions like `withAnalysisBarrier` and put AnalysisBarrier explicitly in logical plan, then predicates won't be pushed down on this particular DataFrames and PP will be still possible on the second one. > Enable preventing predicate pushdown > > > Key: SPARK-24288 > URL: https://issues.apache.org/jira/browse/SPARK-24288 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Tomasz Gawęda >Assignee: Maryann Xue >Priority: Major > Fix For: 2.4.0 > > Attachments: SPARK-24288.simple.patch > > > Issue discussed on Mailing List: > [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html] > While working with JDBC datasource I saw that many "or" clauses with > non-equality operators causes huge performance degradation of SQL query > to database (DB2). For example: > val df = spark.read.format("jdbc").(other options to parallelize > load).load() > df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x > > 100)").show() // in real application whose predicates were pushed > many many lines below, many ANDs and ORs > If I use cache() before where, there is no predicate pushdown of this > "where" clause. However, in production system caching many sources is a > waste of memory (especially is pipeline is long and I must do cache many > times).There are also few more workarounds, but it would be great if Spark > will support preventing predicate pushdown by user. > > For example: df.withAnalysisBarrier().where(...) ? > > Note, that this should not be a global configuration option. If I read
[jira] [Resolved] (SPARK-24288) Enable preventing predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24288. - Resolution: Fixed Assignee: Maryann Xue Fix Version/s: 2.4.0 > Enable preventing predicate pushdown > > > Key: SPARK-24288 > URL: https://issues.apache.org/jira/browse/SPARK-24288 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Tomasz Gawęda >Assignee: Maryann Xue >Priority: Major > Fix For: 2.4.0 > > Attachments: SPARK-24288.simple.patch > > > Issue discussed on Mailing List: > [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html] > While working with JDBC datasource I saw that many "or" clauses with > non-equality operators causes huge performance degradation of SQL query > to database (DB2). For example: > val df = spark.read.format("jdbc").(other options to parallelize > load).load() > df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x > > 100)").show() // in real application whose predicates were pushed > many many lines below, many ANDs and ORs > If I use cache() before where, there is no predicate pushdown of this > "where" clause. However, in production system caching many sources is a > waste of memory (especially is pipeline is long and I must do cache many > times).There are also few more workarounds, but it would be great if Spark > will support preventing predicate pushdown by user. > > For example: df.withAnalysisBarrier().where(...) ? > > Note, that this should not be a global configuration option. If I read 2 > DataFrames, df1 and df2, I would like to specify that df1 should not have > some predicates pushed down, but some may be, but df2 should have all > predicates pushed down, even if target query joins df1 and df2. As far as I > understand Spark optimizer, if we use functions like `withAnalysisBarrier` > and put AnalysisBarrier explicitly in logical plan, then predicates won't be > pushed down on this particular DataFrames and PP will be still possible on > the second one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24865) Remove AnalysisBarrier
[ https://issues.apache.org/jira/browse/SPARK-24865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24865. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21822 [https://github.com/apache/spark/pull/21822] > Remove AnalysisBarrier > -- > > Key: SPARK-24865 > URL: https://issues.apache.org/jira/browse/SPARK-24865 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > Fix For: 2.4.0 > > > AnalysisBarrier was introduced in SPARK-20392 to improve analysis speed > (don't re-analyze nodes that have already been analyzed). > Before AnalysisBarrier, we already had some infrastructure in place, with > analysis specific functions (resolveOperators and resolveExpressions). These > functions do not recursively traverse down subplans that are already analyzed > (with a mutable boolean flag _analyzed). The issue with the old system was > that developers started using transformDown, which does a top-down traversal > of the plan tree, because there was not top-down resolution function, and as > a result analyzer performance became pretty bad. > In order to fix the issue in SPARK-20392, AnalysisBarrier was introduced as a > special node and for this special node, transform/transformUp/transformDown > don't traverse down. However, the introduction of this special node caused a > lot more troubles than it solves. This implicit node breaks assumptions and > code in a few places, and it's hard to know when analysis barrier would > exist, and when it wouldn't. Just a simple search of AnalysisBarrier in PR > discussions demonstrates it is a source of bugs and additional complexity. > Instead, I think a much simpler fix to the original issue is to introduce > resolveOperatorsDown, and change all places that call transformDown in the > analyzer to use that. We can also ban accidental uses of the various > transform* methods by using a linter (which can only lint specific packages), > or in test mode inspect the stack trace and fail explicitly if transform* are > called in the analyzer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24865) Remove AnalysisBarrier
[ https://issues.apache.org/jira/browse/SPARK-24865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24865: --- Assignee: Reynold Xin > Remove AnalysisBarrier > -- > > Key: SPARK-24865 > URL: https://issues.apache.org/jira/browse/SPARK-24865 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > Fix For: 2.4.0 > > > AnalysisBarrier was introduced in SPARK-20392 to improve analysis speed > (don't re-analyze nodes that have already been analyzed). > Before AnalysisBarrier, we already had some infrastructure in place, with > analysis specific functions (resolveOperators and resolveExpressions). These > functions do not recursively traverse down subplans that are already analyzed > (with a mutable boolean flag _analyzed). The issue with the old system was > that developers started using transformDown, which does a top-down traversal > of the plan tree, because there was not top-down resolution function, and as > a result analyzer performance became pretty bad. > In order to fix the issue in SPARK-20392, AnalysisBarrier was introduced as a > special node and for this special node, transform/transformUp/transformDown > don't traverse down. However, the introduction of this special node caused a > lot more troubles than it solves. This implicit node breaks assumptions and > code in a few places, and it's hard to know when analysis barrier would > exist, and when it wouldn't. Just a simple search of AnalysisBarrier in PR > discussions demonstrates it is a source of bugs and additional complexity. > Instead, I think a much simpler fix to the original issue is to introduce > resolveOperatorsDown, and change all places that call transformDown in the > analyzer to use that. We can also ban accidental uses of the various > transform* methods by using a linter (which can only lint specific packages), > or in test mode inspect the stack trace and fail explicitly if transform* are > called in the analyzer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11083) insert overwrite table failed when beeline reconnect
[ https://issues.apache.org/jira/browse/SPARK-11083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559297#comment-16559297 ] readme_kylin commented on SPARK-11083: -- is any one working on this issue? spark 2.1.0 thrift server: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor122.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:716) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:672) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:672) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:672) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:230) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:229) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:272) at org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:671) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:741) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:739) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:739) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95) at org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:739) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:323) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:170) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:347) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87) at org.apache.spark.sql.Dataset.(Dataset.scala:185) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:220) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:163) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:160) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:173) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source hdfs > insert overwrite table failed when beeline reconnect > > > Key: SPARK-11083 > URL: https://issues.apache.org/jira/browse/SPARK-11083 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: Spark: master branch > Hadoop: 2.7.1 > JDK: 1.8.0_60 >Reporter: Weizhong >Assignee: Davies Liu >Priority: Major > > 1. Start Thriftserver > 2. Use
[jira] [Resolved] (SPARK-24929) Merge script swallow KeyboardInterrupt
[ https://issues.apache.org/jira/browse/SPARK-24929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24929. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21880 [https://github.com/apache/spark/pull/21880] > Merge script swallow KeyboardInterrupt > -- > > Key: SPARK-24929 > URL: https://issues.apache.org/jira/browse/SPARK-24929 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > Fix For: 2.4.0 > > > If I want to get out of the loop to assign JIRA's user by command+c > (KeyboardInterrupt), I am unable to get out as below: > {code} > Error assigning JIRA, try again (or leave blank and fix manually) > JIRA is unassigned, choose assignee > [0] todd.chen (Reporter) > Enter number of user, or userid, to assign to (blank to leave > unassigned):Traceback (most recent call last): > File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee > "Enter number of user, or userid, to assign to (blank to leave > unassigned):") > KeyboardInterrupt > Error assigning JIRA, try again (or leave blank and fix manually) > JIRA is unassigned, choose assignee > [0] todd.chen (Reporter) > Enter number of user, or userid, to assign to (blank to leave > unassigned):Traceback (most recent call last): > File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee > "Enter number of user, or userid, to assign to (blank to leave > unassigned):") > KeyboardInterrupt > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24929) Merge script swallow KeyboardInterrupt
[ https://issues.apache.org/jira/browse/SPARK-24929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-24929: Assignee: Hyukjin Kwon > Merge script swallow KeyboardInterrupt > -- > > Key: SPARK-24929 > URL: https://issues.apache.org/jira/browse/SPARK-24929 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > Fix For: 2.4.0 > > > If I want to get out of the loop to assign JIRA's user by command+c > (KeyboardInterrupt), I am unable to get out as below: > {code} > Error assigning JIRA, try again (or leave blank and fix manually) > JIRA is unassigned, choose assignee > [0] todd.chen (Reporter) > Enter number of user, or userid, to assign to (blank to leave > unassigned):Traceback (most recent call last): > File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee > "Enter number of user, or userid, to assign to (blank to leave > unassigned):") > KeyboardInterrupt > Error assigning JIRA, try again (or leave blank and fix manually) > JIRA is unassigned, choose assignee > [0] todd.chen (Reporter) > Enter number of user, or userid, to assign to (blank to leave > unassigned):Traceback (most recent call last): > File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee > "Enter number of user, or userid, to assign to (blank to leave > unassigned):") > KeyboardInterrupt > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24829) In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or spark-sql
[ https://issues.apache.org/jira/browse/SPARK-24829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24829. -- Resolution: Fixed Assignee: zuotingbing Fix Version/s: 2.4.0 Fixed in https://github.com/apache/spark/pull/21789 > In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or > spark-sql > - > > Key: SPARK-24829 > URL: https://issues.apache.org/jira/browse/SPARK-24829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: zuotingbing >Assignee: zuotingbing >Priority: Major > Fix For: 2.4.0 > > Attachments: 2018-07-18_110944.png, 2018-07-18_11.png > > > SELECT CAST('4.56' AS FLOAT) > the result is 4.55942779541 , it should be 4.56 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24942) Improve cluster resource management with jobs containing barrier stage
Jiang Xingbo created SPARK-24942: Summary: Improve cluster resource management with jobs containing barrier stage Key: SPARK-24942 URL: https://issues.apache.org/jira/browse/SPARK-24942 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Jiang Xingbo https://github.com/apache/spark/pull/21758#discussion_r205652317 We shall improve cluster resource management to address the following issues: - With dynamic resource allocation enabled, it may happen that we acquire some executors (but not enough to launch all the tasks in a barrier stage) and later release them due to executor idle time expire, and then acquire again. - There can be deadlock with two concurrent applications. Each application may acquire some resources, but not enough to launch all the tasks in a barrier stage. And after hitting the idle timeout and releasing them, they may acquire resources again, but just continually trade resources between each other. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24941) Add RDDBarrier.coalesce() function
Jiang Xingbo created SPARK-24941: Summary: Add RDDBarrier.coalesce() function Key: SPARK-24941 URL: https://issues.apache.org/jira/browse/SPARK-24941 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Jiang Xingbo https://github.com/apache/spark/pull/21758#discussion_r204917245 The number of partitions from the input data can be unexpectedly large, eg. if you do {code} sc.textFile(...).barrier().mapPartitions() {code} The number of input partitions is based on the hdfs input splits. We shall provide a way in RDDBarrier to enable users to specify the number of tasks in a barrier stage. Maybe something like RDDBarrier.coalesce(numPartitions: Int) . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24801) Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-24801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-24801. -- Resolution: Fixed Assignee: Misha Dmitriev Fix Version/s: 2.4.0 Fixed in commit https://github.com/apache/spark/commit/094aa597155dfcbf41a2490c9e462415e3824901 from https://github.com/apache/spark/pull/21811 > Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can > waste a lot of memory > --- > > Key: SPARK-24801 > URL: https://issues.apache.org/jira/browse/SPARK-24801 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Labels: memory-analysis > Fix For: 2.4.0 > > > I recently analyzed another Yarn NM heap dump with jxray > ([www.jxray.com),|http://www.jxray.com),/] and found that 81% of memory is > wasted by empty (all zeroes) byte[] arrays. Most of these arrays are > referenced by > {{org.apache.spark.network.util.ByteArrayWritableChannel.data}}, and these in > turn come from > {{spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel}}. Here is > the full reference chain that leads to the problematic arrays: > {code:java} > 2,597,946K (64.1%): byte[]: 40583 / 100% of empty 2,597,946K (64.1%) > ↖org.apache.spark.network.util.ByteArrayWritableChannel.data > ↖org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel > ↖io.netty.channel.ChannelOutboundBuffer$Entry.msg > ↖io.netty.channel.ChannelOutboundBuffer$Entry.{next} > ↖io.netty.channel.ChannelOutboundBuffer.flushedEntry > ↖io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer > ↖io.netty.channel.socket.nio.NioSocketChannel.unsafe > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance{code} > > Checking the code of {{SaslEncryption$EncryptedMessage}}, I see that > byteChannel is always initialized eagerly in the constructor: > {code:java} > this.byteChannel = new ByteArrayWritableChannel(maxOutboundBlockSize);{code} > So I think to address the problem of empty byte[] arrays flooding the memory, > we should initialize {{byteChannel}} lazily, upon the first use. As far as I > can see, it's used only in one method, {{private void nextChunk()}}. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24932) Allow update mode for streaming queries with join
[ https://issues.apache.org/jira/browse/SPARK-24932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24932: Assignee: (was: Apache Spark) > Allow update mode for streaming queries with join > - > > Key: SPARK-24932 > URL: https://issues.apache.org/jira/browse/SPARK-24932 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1 >Reporter: Eric Fu >Priority: Major > > In issue SPARK-19140 we supported update output mode for non-aggregation > streaming queries. This should also be applied to streaming join to keep > semantic consistent. > PS. Streaming join feature is added after SPARK-19140. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24932) Allow update mode for streaming queries with join
[ https://issues.apache.org/jira/browse/SPARK-24932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559181#comment-16559181 ] Apache Spark commented on SPARK-24932: -- User 'fuyufjh' has created a pull request for this issue: https://github.com/apache/spark/pull/21890 > Allow update mode for streaming queries with join > - > > Key: SPARK-24932 > URL: https://issues.apache.org/jira/browse/SPARK-24932 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1 >Reporter: Eric Fu >Priority: Major > > In issue SPARK-19140 we supported update output mode for non-aggregation > streaming queries. This should also be applied to streaming join to keep > semantic consistent. > PS. Streaming join feature is added after SPARK-19140. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24932) Allow update mode for streaming queries with join
[ https://issues.apache.org/jira/browse/SPARK-24932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Fu updated SPARK-24932: Description: In issue SPARK-19140 we supported update output mode for non-aggregation streaming queries. This should also be applied to streaming join to keep semantic consistent. PS. Streaming join feature is added after SPARK-19140. When using update _output_ mode the join will works exactly as _append_ mode. However, for example, this will allow user to run an aggregation-after-join query in update mode in order to get a more real-time result output. was: In issue SPARK-19140 we supported update output mode for non-aggregation streaming queries. This should also be applied to streaming join to keep semantic consistent. PS. Streaming join feature is added after SPARK-19140. > Allow update mode for streaming queries with join > - > > Key: SPARK-24932 > URL: https://issues.apache.org/jira/browse/SPARK-24932 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1 >Reporter: Eric Fu >Priority: Major > > In issue SPARK-19140 we supported update output mode for non-aggregation > streaming queries. This should also be applied to streaming join to keep > semantic consistent. > PS. Streaming join feature is added after SPARK-19140. > When using update _output_ mode the join will works exactly as _append_ mode. > However, for example, this will allow user to run an aggregation-after-join > query in update mode in order to get a more real-time result output. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24932) Allow update mode for streaming queries with join
[ https://issues.apache.org/jira/browse/SPARK-24932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24932: Assignee: Apache Spark > Allow update mode for streaming queries with join > - > > Key: SPARK-24932 > URL: https://issues.apache.org/jira/browse/SPARK-24932 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1 >Reporter: Eric Fu >Assignee: Apache Spark >Priority: Major > > In issue SPARK-19140 we supported update output mode for non-aggregation > streaming queries. This should also be applied to streaming join to keep > semantic consistent. > PS. Streaming join feature is added after SPARK-19140. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet
[ https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559083#comment-16559083 ] Apache Spark commented on SPARK-4502: - User 'ajacques' has created a pull request for this issue: https://github.com/apache/spark/pull/21889 > Spark SQL reads unneccesary nested fields from Parquet > -- > > Key: SPARK-4502 > URL: https://issues.apache.org/jira/browse/SPARK-4502 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Liwen Sun >Priority: Critical > > When reading a field of a nested column from Parquet, SparkSQL reads and > assemble all the fields of that nested column. This is unnecessary, as > Parquet supports fine-grained field reads out of a nested column. This may > degrades the performance significantly when a nested column has many fields. > For example, I loaded json tweets data into SparkSQL and ran the following > query: > {{SELECT User.contributors_enabled from Tweets;}} > User is a nested structure that has 38 primitive fields (for Tweets schema, > see: https://dev.twitter.com/overview/api/tweets), here is the log message: > {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 > cell/ms}} > For comparison, I also ran: > {{SELECT User FROM Tweets;}} > And here is the log message: > {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} > So both queries load 38 columns from Parquet, while the first query only > needs 1 column. I also measured the bytes read within Parquet. In these two > cases, the same number of bytes (99365194 bytes) were read. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24940) Coalesce Hint for SQL Queries
[ https://issues.apache.org/jira/browse/SPARK-24940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-24940: --- Summary: Coalesce Hint for SQL Queries (was: Coalesce Hint for SQL) > Coalesce Hint for SQL Queries > - > > Key: SPARK-24940 > URL: https://issues.apache.org/jira/browse/SPARK-24940 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: John Zhuge >Priority: Major > > Many Spark SQL users in my company have asked for a way to control the number > of output files in Spark SQL. The users prefer not to use function > repartition\(n\) or coalesce(n, shuffle) that require them to write and > deploy Scala/Java/Python code. > > There are use cases to either reduce or increase the number. > > The DataFrame API has repartition/coalesce for a long time. However, we do > not have an equivalent functionality in SQL queries. We propose adding the > following Hive-style Coalesce hint to Spark SQL. > {noformat} > /*+ COALESCE(n, shuffle) */ > /*+ REPARTITION(n) */ > {noformat} > REPARTITION\(n\) is equal to COALESCE(n, shuffle=true). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24940) Coalesce Hint for SQL
[ https://issues.apache.org/jira/browse/SPARK-24940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-24940: --- Description: Many Spark SQL users in my company have asked for a way to control the number of output files in Spark SQL. The users prefer not to use function repartition\(n\) or coalesce(n, shuffle) that require them to write and deploy Scala/Java/Python code. There are use cases to either reduce or increase the number. The DataFrame API has repartition/coalesce for a long time. However, we do not have an equivalent functionality in SQL queries. We propose adding the following Hive-style Coalesce hint to Spark SQL. {noformat} /*+ COALESCE(n, shuffle) */ /*+ REPARTITION(n) */ {noformat} REPARTITION\(n\) is equal to COALESCE(n, shuffle=true). was: Many Spark SQL users in my company have asked for a way to control the number of output files in Spark SQL. The users prefer not to use function repartition(n) or coalesce(n, shuffle) that require them to write and deploy Scala/Java/Python code. There are use cases to either reduce or increase the number. The DataFrame API has repartition/coalesce for a long time. However, we do not have an equivalent functionality in SQL queries. We propose adding the following Hive-style Coalesce hint to Spark SQL. {noformat} /*+ COALESCE(n, shuffle) */ /*+ REPARTITION(n) */ {noformat} REPARTITION(n) is equal to COALESCE(n, shuffle=true). > Coalesce Hint for SQL > - > > Key: SPARK-24940 > URL: https://issues.apache.org/jira/browse/SPARK-24940 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: John Zhuge >Priority: Major > > Many Spark SQL users in my company have asked for a way to control the number > of output files in Spark SQL. The users prefer not to use function > repartition\(n\) or coalesce(n, shuffle) that require them to write and > deploy Scala/Java/Python code. > > There are use cases to either reduce or increase the number. > > The DataFrame API has repartition/coalesce for a long time. However, we do > not have an equivalent functionality in SQL queries. We propose adding the > following Hive-style Coalesce hint to Spark SQL. > {noformat} > /*+ COALESCE(n, shuffle) */ > /*+ REPARTITION(n) */ > {noformat} > REPARTITION\(n\) is equal to COALESCE(n, shuffle=true). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24940) Coalesce Hint for SQL
John Zhuge created SPARK-24940: -- Summary: Coalesce Hint for SQL Key: SPARK-24940 URL: https://issues.apache.org/jira/browse/SPARK-24940 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.1 Reporter: John Zhuge Many Spark SQL users in my company have asked for a way to control the number of output files in Spark SQL. The users prefer not to use function repartition(n) or coalesce(n, shuffle) that require them to write and deploy Scala/Java/Python code. There are use cases to either reduce or increase the number. The DataFrame API has repartition/coalesce for a long time. However, we do not have an equivalent functionality in SQL queries. We propose adding the following Hive-style Coalesce hint to Spark SQL. {noformat} /*+ COALESCE(n, shuffle) */ /*+ REPARTITION(n) */ {noformat} REPARTITION(n) is equal to COALESCE(n, shuffle=true). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24919) Scala linter rule for sparkContext.hadoopConfiguration
[ https://issues.apache.org/jira/browse/SPARK-24919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24919. - Resolution: Fixed Assignee: Gengliang Wang Fix Version/s: 2.4.0 > Scala linter rule for sparkContext.hadoopConfiguration > -- > > Key: SPARK-24919 > URL: https://issues.apache.org/jira/browse/SPARK-24919 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > > In most cases, we should use spark.sessionState.newHadoopConf() instead of > sparkContext.hadoopConfiguration, so that the hadoop configurations specified > in Spark session > configuration will come into effect. > Add a rule matching spark.sparkContext.hadoopConfiguration or > spark.sqlContext.sparkContext.hadoopConfiguration to prevent the usage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24253) DataSourceV2: Add DeleteSupport for delete and overwrite operations
[ https://issues.apache.org/jira/browse/SPARK-24253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558971#comment-16558971 ] Apache Spark commented on SPARK-24253: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/21888 > DataSourceV2: Add DeleteSupport for delete and overwrite operations > --- > > Key: SPARK-24253 > URL: https://issues.apache.org/jira/browse/SPARK-24253 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > Implementing delete and overwrite logical plans requires an API to delete > data from a data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-24937) Datasource partition table should load empty partitions
[ https://issues.apache.org/jira/browse/SPARK-24937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24937: Comment: was deleted (was: I'm working on.) > Datasource partition table should load empty partitions > --- > > Key: SPARK-24937 > URL: https://issues.apache.org/jira/browse/SPARK-24937 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:sql} > spark-sql> CREATE TABLE tbl AS SELECT 1; > spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING) > > USING parquet > > PARTITIONED BY (day, hour); > spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') > SELECT * FROM tbl where 1=0; > spark-sql> SHOW PARTITIONS tbl1; > spark-sql> CREATE TABLE tbl2 (c1 BIGINT) > > PARTITIONED BY (day STRING, hour STRING); > 18/07/26 22:49:20 WARN HiveMetaStore: Location: > file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external > table:tbl2 > spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') > SELECT * FROM tbl where 1=0; > 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2 > 18/07/26 22:49:36 WARN log: Updated size to 0 > spark-sql> SHOW PARTITIONS tbl2; > day=2018-07-25/hour=01 > spark-sql> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24939) Support YARN Shared Cache in Spark
Jonathan Bender created SPARK-24939: --- Summary: Support YARN Shared Cache in Spark Key: SPARK-24939 URL: https://issues.apache.org/jira/browse/SPARK-24939 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 2.3.1 Reporter: Jonathan Bender https://issues.apache.org/jira/browse/YARN-1492 introduced support for the YARN Shared Cache, which when configured allows clients to cache submitted application resources (jars, archives) in HDFS and avoid having to re-upload them for successive jobs. MapReduce YARN applications support this feature, it would be great to add support for it in Spark as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24918) Executor Plugin API
[ https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558829#comment-16558829 ] Imran Rashid commented on SPARK-24918: -- The only thing I *really* needed was just to be able to instantiate some arbitrary class when the executor starts up. My instrumentation code could do the rest via reflection from there. But I might want more eventually, eg. with task start & end events, I could imagine setting something up to periodically take stack traces only for if there is a stage running in stage X or for longer than Y seconds etc. > Executor Plugin API > --- > > Key: SPARK-24918 > URL: https://issues.apache.org/jira/browse/SPARK-24918 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > It would be nice if we could specify an arbitrary class to run within each > executor for debugging and instrumentation. Its hard to do this currently > because: > a) you have no idea when executors will come and go with DynamicAllocation, > so don't have a chance to run custom code before the first task > b) even with static allocation, you'd have to change the code of your spark > app itself to run a special task to "install" the plugin, which is often > tough in production cases when those maintaining regularly running > applications might not even know how to make changes to the application. > For example, https://github.com/squito/spark-memory could be used in a > debugging context to understand memory use, just by re-running an application > with extra command line arguments (as opposed to rebuilding spark). > I think one tricky part here is just deciding the api, and how its versioned. > Does it just get created when the executor starts, and thats it? Or does it > get more specific events, like task start, task end, etc? Would we ever add > more events? It should definitely be a {{DeveloperApi}}, so breaking > compatibility would be allowed ... but still should be avoided. We could > create a base class that has no-op implementations, or explicitly version > everything. > Note that this is not needed in the driver as we already have SparkListeners > (even if you don't care about the SparkListenerEvents and just want to > inspect objects in the JVM, its still good enough). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24801) Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-24801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-24801: - Labels: memory-analysis (was: ) > Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can > waste a lot of memory > --- > > Key: SPARK-24801 > URL: https://issues.apache.org/jira/browse/SPARK-24801 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Priority: Major > Labels: memory-analysis > > I recently analyzed another Yarn NM heap dump with jxray > ([www.jxray.com),|http://www.jxray.com),/] and found that 81% of memory is > wasted by empty (all zeroes) byte[] arrays. Most of these arrays are > referenced by > {{org.apache.spark.network.util.ByteArrayWritableChannel.data}}, and these in > turn come from > {{spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel}}. Here is > the full reference chain that leads to the problematic arrays: > {code:java} > 2,597,946K (64.1%): byte[]: 40583 / 100% of empty 2,597,946K (64.1%) > ↖org.apache.spark.network.util.ByteArrayWritableChannel.data > ↖org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel > ↖io.netty.channel.ChannelOutboundBuffer$Entry.msg > ↖io.netty.channel.ChannelOutboundBuffer$Entry.{next} > ↖io.netty.channel.ChannelOutboundBuffer.flushedEntry > ↖io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer > ↖io.netty.channel.socket.nio.NioSocketChannel.unsafe > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance{code} > > Checking the code of {{SaslEncryption$EncryptedMessage}}, I see that > byteChannel is always initialized eagerly in the constructor: > {code:java} > this.byteChannel = new ByteArrayWritableChannel(maxOutboundBlockSize);{code} > So I think to address the problem of empty byte[] arrays flooding the memory, > we should initialize {{byteChannel}} lazily, upon the first use. As far as I > can see, it's used only in one method, {{private void nextChunk()}}. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools
[ https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558798#comment-16558798 ] Imran Rashid commented on SPARK-24938: -- This should be an easy change to make, its just a question of running a test. I keep meaning to do it, but have too many other things in flight, so anybody is welcome to do it. You could use SPARK-24918 and https://github.com/squito/spark-memory to check the memory usage before and after. > Understand usage of netty's onheap memory use, even with offheap pools > -- > > Key: SPARK-24938 > URL: https://issues.apache.org/jira/browse/SPARK-24938 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > We've observed that netty uses large amount of onheap memory in its pools, in > addition to the expected offheap memory when I added some instrumentation > (using SPARK-24918 and https://github.com/squito/spark-memory). We should > figure out why its using that memory, and whether its really necessary. > It might be just this one line: > https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82 > which means that even with a small burst of messages, each arena will grow by > 16MB which could lead to a 128 MB spike of an almost entirely unused pool. > Switching to requesting a buffer from the default pool would probably fix > this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools
Imran Rashid created SPARK-24938: Summary: Understand usage of netty's onheap memory use, even with offheap pools Key: SPARK-24938 URL: https://issues.apache.org/jira/browse/SPARK-24938 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: Imran Rashid We've observed that netty uses large amount of onheap memory in its pools, in addition to the expected offheap memory when I added some instrumentation (using SPARK-24918 and https://github.com/squito/spark-memory). We should figure out why its using that memory, and whether its really necessary. It might be just this one line: https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82 which means that even with a small burst of messages, each arena will grow by 16MB which could lead to a 128 MB spike of an almost entirely unused pool. Switching to requesting a buffer from the default pool would probably fix this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23633) Update Pandas UDFs section in sql-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23633: Assignee: (was: Apache Spark) > Update Pandas UDFs section in sql-programming-guide > > > Key: SPARK-23633 > URL: https://issues.apache.org/jira/browse/SPARK-23633 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Li Jin >Priority: Major > > Let's make sure sql-programming-guide is up-to-date before 2.4 release. > https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#pandas-udfs-aka-vectorized-udfs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23633) Update Pandas UDFs section in sql-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558796#comment-16558796 ] Apache Spark commented on SPARK-23633: -- User 'icexelloss' has created a pull request for this issue: https://github.com/apache/spark/pull/21887 > Update Pandas UDFs section in sql-programming-guide > > > Key: SPARK-23633 > URL: https://issues.apache.org/jira/browse/SPARK-23633 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Li Jin >Priority: Major > > Let's make sure sql-programming-guide is up-to-date before 2.4 release. > https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#pandas-udfs-aka-vectorized-udfs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23633) Update Pandas UDFs section in sql-programming-guide
[ https://issues.apache.org/jira/browse/SPARK-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23633: Assignee: Apache Spark > Update Pandas UDFs section in sql-programming-guide > > > Key: SPARK-23633 > URL: https://issues.apache.org/jira/browse/SPARK-23633 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Li Jin >Assignee: Apache Spark >Priority: Major > > Let's make sure sql-programming-guide is up-to-date before 2.4 release. > https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#pandas-udfs-aka-vectorized-udfs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24795) Implement barrier execution mode
[ https://issues.apache.org/jira/browse/SPARK-24795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24795. - Resolution: Fixed Assignee: Jiang Xingbo Fix Version/s: 2.4.0 > Implement barrier execution mode > > > Key: SPARK-24795 > URL: https://issues.apache.org/jira/browse/SPARK-24795 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo >Priority: Major > Fix For: 2.4.0 > > > Implement barrier execution mode, as described in SPARK-24582 > Include all the API changes and basic implementation (except for > BarrierTaskContext.barrier()) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14543) SQL/Hive insertInto has unexpected results
[ https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-14543. --- Resolution: Later This is addressed by SPARK-24251 for DataSourceV2 writers. > SQL/Hive insertInto has unexpected results > -- > > Key: SPARK-14543 > URL: https://issues.apache.org/jira/browse/SPARK-14543 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > > *Updated description* > There should be an option to match input data to output columns by name. The > API allows operations on tables, which hide the column resolution problem. > It's easy to copy from one table to another without listing the columns, and > in the API it is common to work with columns by name rather than by position. > I think the API should add a way to match columns by name, which is closer to > what users expect. I propose adding something like this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} > *Original description* > The Hive write path adds a pre-insertion cast (projection) to reconcile > incoming data columns with the outgoing table schema. Columns are matched by > position and casts are inserted to reconcile the two column schemas. > When columns aren't correctly aligned, this causes unexpected results. I ran > into this by not using a correct {{partitionBy}} call (addressed by > SPARK-14459), which caused an error message that an int could not be cast to > an array. However, if the columns are vaguely compatible, for example string > and float, then no error or warning is produced and data is written to the > wrong columns using unexpected casts (string -> bigint -> float). > A real-world use case that will hit this is when a table definition changes > by adding a column in the middle of a table. Spark SQL statements that copied > from that table to a destination table will then map the columns differently > but insert casts that mask the problem. The last column's data will be > dropped without a reliable warning for the user. > This highlights a few problems: > * Too many or too few incoming data columns should cause an AnalysisException > to be thrown > * Only "safe" casts should be inserted automatically, like int -> long, using > UpCast > * Pre-insertion casts currently ignore extra columns by using zip > * The pre-insertion cast logic differs between Hive's MetastoreRelation and > LogicalRelation > Also, I think there should be an option to match input data to output columns > by name. The API allows operations on tables, which hide the column > resolution problem. It's easy to copy from one table to another without > listing the columns, and in the API it is common to work with columns by name > rather than by position. I think the API should add a way to match columns by > name, which is closer to what users expect. I propose adding something like > this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24882) separate responsibilities of the data source v2 read API
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558666#comment-16558666 ] Ryan Blue edited comment on SPARK-24882 at 7/26/18 6:19 PM: [~cloud_fan], I'm adding some suggestions here because comments on the doc are good for discussion, but not really for longer content. I like the separation of the batch, micro-batch, and streaming classes. That works well. I also like the addition of the Metadata class, though I'd use a more specific name. There are a few naming changes I would make to be more specific and to preserve existing names or naming conventions: * Instead of ReaderProvider, I think we should use ReaderFactory because that name corresponds to the write path and accurately describes the class * I think we should continue to use InputPartition instead of InputSplit, even if we introduce a reader factory. (Probably also rename SplitReader to PartitionReader.) * Metadata isn't specific so I think we should use ScanConfig instead * getSplits should be planSplits because "get" implies a quick operation (like returning a field's value) in Java. This is also consistent with the current API. Next, instead of using mix-ins on Metadata / ScanConfig, I think a builder would help clarify the order of operations. If ScanConfig is mutable, then it could be passed to the other methods in different states. I'd rather use a Builder to make it Immutable. That way, implementations know that the Metadata / ScanConfig doesn't change between calls to estimateStatistics and getSplits so results can be cached. To make this work, Spark would provide a Builder interface with default methods that do nothing. To implement pushdown, users just need to implement the methods. This also allows us to add new pushdown methods (like pushLimit) without introducing new interfaces. I'd also like to see the classes reorganized a little to reduce the overall number of interfaces: Metadata / ScanConfig contains all of the state that the DataSourceReader used to hold. If the DataSourceReader has no state, then its methods should be provided by the a single instance of the source instead. That would change the API to get rid of the Reader level and merge it into ReadSupport. Then ReadSupport would be used to create a ScanConfig and then BatchReadSupport (or similar) would be used to plan splits and get reader factories. I think this is easier for implementations. {code:lang=java} public interface ReadSupport { ScanConfig.Builder newScanBuilder(); } public interface ReportsStatistics extends ReadSupport { Statistics estimateStatistics(ScanConfig) } public interface BatchReadSupport extends ReadSupport { InputSplit[] planSplits(ScanConfig) ReaderFactory readerFactory() } public interface MicroBatchReadSupport extends ScanSupport { InputPartition[] planSplits(ScanConfig, Offset start, Offset end) Offset initialOffset() MicroBatchReaderFactory readerFactory() } public interface ContinuousReadSupport extends ScanSupport { InputPartition[] planSplits(ScanConfig, Offset start) Offset initialOffset() ContinuousReaderFactory readerFactory() } {code} Note that this change also cleans up the confusion around the use of Reader: the only Reader is a SplitReader that returns rows or row batches. I would keep the same structure that you have for micro-batch, continuous, and batch ReaderFactory and SplitReader. Here's a sketch of the ScanConfig and Builder I mentioned above: {code:lang=java} public interface ScanConfig { StructType schema() Filter[] pushedFilters() Expression[] pushedPredicates() // by default, the Builder doesn't push anything public interface Builder { Builder pushProjection(...) Builder pushFilters(...) default Builder pushPredicates(...) { return this; } Builder pushLimit(...) ScanConfig build() } } {code} was (Author: rdblue): [~cloud_fan], I'm adding some suggestions here because comments on the doc are good for discussion, but not really for longer content. I like the separation of the batch, micro-batch, and streaming classes. That works well. I also like the addition of the Metadata class, though I'd use a more specific name. There are a few naming changes I would make to be more specific and to preserve existing names or naming conventions: * Instead of ReaderProvider, I think we should use ReaderFactory because that name corresponds to the write path and accurately describes the class * I think we should continue to use InputPartition instead of InputSplit, even if we introduce a reader factory. (Probably also rename SplitReader to PartitionReader.) * Metadata isn't specific so I think we should use ScanConfig instead * getSplits should be planSplits because get implies a quick operation in Java. This is also consistent with the current API. Next,
[jira] [Comment Edited] (SPARK-24882) separate responsibilities of the data source v2 read API
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558666#comment-16558666 ] Ryan Blue edited comment on SPARK-24882 at 7/26/18 6:18 PM: [~cloud_fan], I'm adding some suggestions here because comments on the doc are good for discussion, but not really for longer content. I like the separation of the batch, micro-batch, and streaming classes. That works well. I also like the addition of the Metadata class, though I'd use a more specific name. There are a few naming changes I would make to be more specific and to preserve existing names or naming conventions: * Instead of ReaderProvider, I think we should use ReaderFactory because that name corresponds to the write path and accurately describes the class * I think we should continue to use InputPartition instead of InputSplit, even if we introduce a reader factory. (Probably also rename SplitReader to PartitionReader.) * Metadata isn't specific so I think we should use ScanConfig instead * getSplits should be planSplits because get implies a quick operation in Java. This is also consistent with the current API. Next, instead of using mix-ins on Metadata / ScanConfig, I think a builder would help clarify the order of operations. If ScanConfig is mutable, then it could be passed to the other methods in different states. I'd rather use a Builder to make it Immutable. That way, implementations know that the Metadata / ScanConfig doesn't change between calls to estimateStatistics and getSplits so results can be cached. To make this work, Spark would provide a Builder interface with default methods that do nothing. To implement pushdown, users just need to implement the methods. This also allows us to add new pushdown methods (like pushLimit) without introducing new interfaces. I'd also like to see the classes reorganized a little to reduce the overall number of interfaces: Metadata / ScanConfig contains all of the state that the DataSourceReader used to hold. If the DataSourceReader has no state, then its methods should be provided by the a single instance of the source instead. That would change the API to get rid of the Reader level and merge it into ReadSupport. Then ReadSupport would be used to create a ScanConfig and then BatchReadSupport (or similar) would be used to plan splits and get reader factories. I think this is easier for implementations. {code:lang=java} public interface ReadSupport { ScanConfig.Builder newScanBuilder(); } public interface ReportsStatistics extends ReadSupport { Statistics estimateStatistics(ScanConfig) } public interface BatchReadSupport extends ReadSupport { InputSplit[] planSplits(ScanConfig) ReaderFactory readerFactory() } public interface MicroBatchReadSupport extends ScanSupport { InputPartition[] planSplits(ScanConfig, Offset start, Offset end) Offset initialOffset() MicroBatchReaderFactory readerFactory() } public interface ContinuousReadSupport extends ScanSupport { InputPartition[] planSplits(ScanConfig, Offset start) Offset initialOffset() ContinuousReaderFactory readerFactory() } {code} Note that this change also cleans up the confusion around the use of Reader: the only Reader is a SplitReader that returns rows or row batches. I would keep the same structure that you have for micro-batch, continuous, and batch ReaderFactory and SplitReader. Here's a sketch of the ScanConfig and Builder I mentioned above: {code:lang=java} public interface ScanConfig { StructType schema() Filter[] pushedFilters() Expression[] pushedPredicates() // by default, the Builder doesn't push anything public interface Builder { Builder pushProjection(...) Builder pushFilters(...) default Builder pushPredicates(...) { return this; } Builder pushLimit(...) ScanConfig build() } } {code} was (Author: rdblue): [~cloud_fan], I'm adding some suggestions here because comments on the doc are good for discussion, but not really for longer content. I like the separation of the batch, micro-batch, and streaming classes. That works well. I also like the addition of the Metadata class, though I'd use a more specific name. There are a few naming changes I would make to be more specific and to preserve existing names or naming conventions: * Instead of ReaderProvider, I think we should use ReaderFactory because that name corresponds to the write path and accurately describes the class * I think we should continue to use InputPartition instead of InputSplit, even if we introduce a reader factory. (Probably also rename SplitReader to PartitionReader.) * Metadata isn't specific so I think we should use ScanConfig instead Next, instead of using mix-ins on Metadata / ScanConfig, I think a builder would help clarify the order of operations. If ScanConfig is mutable, then it could be pass
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558687#comment-16558687 ] Apache Spark commented on SPARK-21274: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/21886 > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24926) Ensure numCores is used consistently in all netty configuration (driver and executors)
[ https://issues.apache.org/jira/browse/SPARK-24926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24926: Assignee: Apache Spark > Ensure numCores is used consistently in all netty configuration (driver and > executors) > -- > > Key: SPARK-24926 > URL: https://issues.apache.org/jira/browse/SPARK-24926 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Apache Spark >Priority: Major > Labels: memory-analysis > > I think there may be some places where we're not passing the right number of > configured cores to netty -- in particular in driver mode, we're not > respecting "spark.driver.cores". This means that spark will configure netty > will be configured based on the number of physical cores of the device, > instead of whatever resources spark requested from the cluster manager. It > controls both the number of threads netty uses *and* the number of arenas in > its memory pools. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24926) Ensure numCores is used consistently in all netty configuration (driver and executors)
[ https://issues.apache.org/jira/browse/SPARK-24926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24926: Assignee: (was: Apache Spark) > Ensure numCores is used consistently in all netty configuration (driver and > executors) > -- > > Key: SPARK-24926 > URL: https://issues.apache.org/jira/browse/SPARK-24926 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > I think there may be some places where we're not passing the right number of > configured cores to netty -- in particular in driver mode, we're not > respecting "spark.driver.cores". This means that spark will configure netty > will be configured based on the number of physical cores of the device, > instead of whatever resources spark requested from the cluster manager. It > controls both the number of threads netty uses *and* the number of arenas in > its memory pools. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24926) Ensure numCores is used consistently in all netty configuration (driver and executors)
[ https://issues.apache.org/jira/browse/SPARK-24926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558688#comment-16558688 ] Apache Spark commented on SPARK-24926: -- User 'NiharS' has created a pull request for this issue: https://github.com/apache/spark/pull/21885 > Ensure numCores is used consistently in all netty configuration (driver and > executors) > -- > > Key: SPARK-24926 > URL: https://issues.apache.org/jira/browse/SPARK-24926 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > I think there may be some places where we're not passing the right number of > configured cores to netty -- in particular in driver mode, we're not > respecting "spark.driver.cores". This means that spark will configure netty > will be configured based on the number of physical cores of the device, > instead of whatever resources spark requested from the cluster manager. It > controls both the number of threads netty uses *and* the number of arenas in > its memory pools. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files
[ https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reassigned SPARK-24927: -- Assignee: Cheng Lian > The hadoop-provided profile doesn't play well with Snappy-compressed Parquet > files > -- > > Key: SPARK-24927 > URL: https://issues.apache.org/jira/browse/SPARK-24927 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.1, 2.3.2 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Major > > Reproduction: > {noformat} > wget > https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz > wget > https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz > tar xzf spark-2.3.1-bin-without-hadoop.tgz > tar xzf hadoop-2.7.3.tar.gz > export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath) > ./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local > ... > scala> > spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet") > {noformat} > Exception: > {noformat} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194) > ... 69 more > Caused by: org.apache.spark.SparkException: Task failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.UnsatisfiedLinkError: > org.xerial.snappy.SnappyNative.maxCompressedLength(I)I > at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) > at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) > at > org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67) > at > org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81) > at > org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92) > at > org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112) > at > org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93) > at > org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150) > at > org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238) > at > org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167) > at
[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558666#comment-16558666 ] Ryan Blue commented on SPARK-24882: --- [~cloud_fan], I'm adding some suggestions here because comments on the doc are good for discussion, but not really for longer content. I like the separation of the batch, micro-batch, and streaming classes. That works well. I also like the addition of the Metadata class, though I'd use a more specific name. There are a few naming changes I would make to be more specific and to preserve existing names or naming conventions: * Instead of ReaderProvider, I think we should use ReaderFactory because that name corresponds to the write path and accurately describes the class * I think we should continue to use InputPartition instead of InputSplit, even if we introduce a reader factory. (Probably also rename SplitReader to PartitionReader.) * Metadata isn't specific so I think we should use ScanConfig instead Next, instead of using mix-ins on Metadata / ScanConfig, I think a builder would help clarify the order of operations. If ScanConfig is mutable, then it could be passed to the other methods in different states. I'd rather use a Builder to make it Immutable. That way, implementations know that the Metadata / ScanConfig doesn't change between calls to estimateStatistics and getSplits so results can be cached. To make this work, Spark would provide a Builder interface with default methods that do nothing. To implement pushdown, users just need to implement the methods. This also allows us to add new pushdown methods (like pushLimit) without introducing new interfaces. I'd also like to see the classes reorganized a little to reduce the overall number of interfaces: Metadata / ScanConfig contains all of the state that the DataSourceReader used to hold. If the DataSourceReader has no state, then its methods should be provided by the a single instance of the source instead. That would change the API to get rid of the Reader level and merge it into ReadSupport. Then ReadSupport would be used to create a ScanConfig and then BatchReadSupport (or similar) would be used to plan splits and get reader factories. I think this is easier for implementations. {code:lang=java} public interface ReadSupport { ScanConfig.Builder newScanBuilder(); } public interface ReportsStatistics extends ReadSupport { Statistics estimateStatistics(ScanConfig) } public interface BatchReadSupport extends ReadSupport { InputSplit[] getSplits(ScanConfig) ReaderFactory readerFactory() } public interface MicroBatchReadSupport extends ScanSupport { InputPartition[] getSplits(ScanConfig, Offset start, Offset end) Offset initialOffset() MicroBatchReaderFactory readerFactory() } public interface ContinuousReadSupport extends ScanSupport { InputPartition[] getSplits(ScanConfig, Offset start) Offset initialOffset() ContinuousReaderFactory readerFactory() } {code} Note that this change also cleans up the confusion around the use of Reader: the only Reader is a SplitReader that returns rows or row batches. I would keep the same structure that you have for micro-batch, continuous, and batch ReaderFactory and SplitReader. Here's a sketch of the ScanConfig and Builder I mentioned above: {code:lang=java} public interface ScanConfig { StructType schema() Filter[] pushedFilters() Expression[] pushedPredicates() // by default, the Builder doesn't push anything public interface Builder { Builder pushProjection(...) Builder pushFilters(...) default Builder pushPredicates(...) { return this; } Builder pushLimit(...) ScanConfig build() } } {code} > separate responsibilities of the data source v2 read API > > > Key: SPARK-24882 > URL: https://issues.apache.org/jira/browse/SPARK-24882 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > Data source V2 is out for a while, see the SPIP > [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. > We have already migrated most of the built-in streaming data sources to the > V2 API, and the file source migration is in progress. During the migration, > we found several problems and want to address them before we stabilize the V2 > API. > To solve these problems, we need to separate responsibilities in the data > source v2 read API. Details please see the attached google doc: > https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) --
[jira] [Commented] (SPARK-23683) FileCommitProtocol.instantiate to require 3-arg constructor for dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-23683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558660#comment-16558660 ] Steve Loughran commented on SPARK-23683: If it's a regression, you could argue for it > FileCommitProtocol.instantiate to require 3-arg constructor for dynamic > partition overwrite > --- > > Key: SPARK-23683 > URL: https://issues.apache.org/jira/browse/SPARK-23683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Major > Fix For: 2.4.0 > > > with SPARK-20236 {{FileCommitProtocol.instantiate()}} looks for a three > argument constructor, passing in the {{dynamicPartitionOverwrite}} parameter. > If there is no such constructor, it falls back to the classic two-arg one. > When {{InsertIntoHadoopFsRelationCommand}} passes down that > {{dynamicPartitionOverwrite}} flag to {{FileCommitProtocol.instantiate()}}, > it _assumes_ that the instantiated protocol supports the specific > requirements of dynamic partition overwrite. It does not notice when this > does not hold, and so the output generated may be incorrect. > Proposed: when dynamicPartitionOverwrite == true, require the protocol > implementation to have a 3-arg constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files
[ https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-24927: --- Description: Reproduction: {noformat} wget https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz tar xzf spark-2.3.1-bin-without-hadoop.tgz tar xzf hadoop-2.7.3.tar.gz export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath) ./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local ... scala> spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet") {noformat} Exception: {noformat} Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194) ... 69 more Caused by: org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67) at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81) at org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92) at org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112) at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93) at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150) at org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238) at org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121) at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167) at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109) at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFo
[jira] [Updated] (SPARK-24934) Complex type and binary type in in-memory partition pruning does not work due to missing upper/lower bounds cases
[ https://issues.apache.org/jira/browse/SPARK-24934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24934: - Summary: Complex type and binary type in in-memory partition pruning does not work due to missing upper/lower bounds cases (was: Should handle missing upper/lower bounds cases in in-memory partition pruning) > Complex type and binary type in in-memory partition pruning does not work due > to missing upper/lower bounds cases > - > > Key: SPARK-24934 > URL: https://issues.apache.org/jira/browse/SPARK-24934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > For example, if array is used (where the lower and upper bounds for its > column batch are {{null}})), it looks wrongly filtering all data out: > {code} > scala> import org.apache.spark.sql.functions > import org.apache.spark.sql.functions > scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol") > df: org.apache.spark.sql.DataFrame = [arrayCol: array] > scala> > df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), > functions.lit("b".show() > ++ > |arrayCol| > ++ > | [a, b]| > ++ > scala> > df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), > functions.lit("b".show() > ++ > |arrayCol| > ++ > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24937) Datasource partition table should load empty partitions
[ https://issues.apache.org/jira/browse/SPARK-24937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24937: Assignee: (was: Apache Spark) > Datasource partition table should load empty partitions > --- > > Key: SPARK-24937 > URL: https://issues.apache.org/jira/browse/SPARK-24937 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:sql} > spark-sql> CREATE TABLE tbl AS SELECT 1; > spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING) > > USING parquet > > PARTITIONED BY (day, hour); > spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') > SELECT * FROM tbl where 1=0; > spark-sql> SHOW PARTITIONS tbl1; > spark-sql> CREATE TABLE tbl2 (c1 BIGINT) > > PARTITIONED BY (day STRING, hour STRING); > 18/07/26 22:49:20 WARN HiveMetaStore: Location: > file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external > table:tbl2 > spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') > SELECT * FROM tbl where 1=0; > 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2 > 18/07/26 22:49:36 WARN log: Updated size to 0 > spark-sql> SHOW PARTITIONS tbl2; > day=2018-07-25/hour=01 > spark-sql> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24937) Datasource partition table should load empty partitions
[ https://issues.apache.org/jira/browse/SPARK-24937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24937: Assignee: Apache Spark > Datasource partition table should load empty partitions > --- > > Key: SPARK-24937 > URL: https://issues.apache.org/jira/browse/SPARK-24937 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > How to reproduce: > {code:sql} > spark-sql> CREATE TABLE tbl AS SELECT 1; > spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING) > > USING parquet > > PARTITIONED BY (day, hour); > spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') > SELECT * FROM tbl where 1=0; > spark-sql> SHOW PARTITIONS tbl1; > spark-sql> CREATE TABLE tbl2 (c1 BIGINT) > > PARTITIONED BY (day STRING, hour STRING); > 18/07/26 22:49:20 WARN HiveMetaStore: Location: > file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external > table:tbl2 > spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') > SELECT * FROM tbl where 1=0; > 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2 > 18/07/26 22:49:36 WARN log: Updated size to 0 > spark-sql> SHOW PARTITIONS tbl2; > day=2018-07-25/hour=01 > spark-sql> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24937) Datasource partition table should load empty partitions
[ https://issues.apache.org/jira/browse/SPARK-24937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558438#comment-16558438 ] Apache Spark commented on SPARK-24937: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/21883 > Datasource partition table should load empty partitions > --- > > Key: SPARK-24937 > URL: https://issues.apache.org/jira/browse/SPARK-24937 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:sql} > spark-sql> CREATE TABLE tbl AS SELECT 1; > spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING) > > USING parquet > > PARTITIONED BY (day, hour); > spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') > SELECT * FROM tbl where 1=0; > spark-sql> SHOW PARTITIONS tbl1; > spark-sql> CREATE TABLE tbl2 (c1 BIGINT) > > PARTITIONED BY (day STRING, hour STRING); > 18/07/26 22:49:20 WARN HiveMetaStore: Location: > file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external > table:tbl2 > spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') > SELECT * FROM tbl where 1=0; > 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2 > 18/07/26 22:49:36 WARN log: Updated size to 0 > spark-sql> SHOW PARTITIONS tbl2; > day=2018-07-25/hour=01 > spark-sql> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24937) Datasource partition table should load empty partitions
[ https://issues.apache.org/jira/browse/SPARK-24937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24937: Description: How to reproduce: {code:sql} spark-sql> CREATE TABLE tbl AS SELECT 1; spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING) > USING parquet > PARTITIONED BY (day, hour); spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') SELECT * FROM tbl where 1=0; spark-sql> SHOW PARTITIONS tbl1; spark-sql> CREATE TABLE tbl2 (c1 BIGINT) > PARTITIONED BY (day STRING, hour STRING); 18/07/26 22:49:20 WARN HiveMetaStore: Location: file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external table:tbl2 spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') SELECT * FROM tbl where 1=0; 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2 18/07/26 22:49:36 WARN log: Updated size to 0 spark-sql> SHOW PARTITIONS tbl2; day=2018-07-25/hour=01 spark-sql> {code} was: How to reproduce: {code:sql} spark-sql> CREATE TABLE tbl AS SELECT 1; 18/07/26 22:48:11 WARN HiveMetaStore: Location: file:/Users/yumwang/tmp/spark/spark-warehouse/tbl specified for non-external table:tbl 18/07/26 22:48:15 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING) > USING parquet > PARTITIONED BY (day, hour); spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') SELECT * FROM tbl where 1=0; spark-sql> SHOW PARTITIONS tbl1; spark-sql> CREATE TABLE tbl2 (c1 BIGINT) > PARTITIONED BY (day STRING, hour STRING); 18/07/26 22:49:20 WARN HiveMetaStore: Location: file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external table:tbl2 spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') SELECT * FROM tbl where 1=0; 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2 18/07/26 22:49:36 WARN log: Updated size to 0 spark-sql> SHOW PARTITIONS tbl2; day=2018-07-25/hour=01 spark-sql> {code} > Datasource partition table should load empty partitions > --- > > Key: SPARK-24937 > URL: https://issues.apache.org/jira/browse/SPARK-24937 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:sql} > spark-sql> CREATE TABLE tbl AS SELECT 1; > spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING) > > USING parquet > > PARTITIONED BY (day, hour); > spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') > SELECT * FROM tbl where 1=0; > spark-sql> SHOW PARTITIONS tbl1; > spark-sql> CREATE TABLE tbl2 (c1 BIGINT) > > PARTITIONED BY (day STRING, hour STRING); > 18/07/26 22:49:20 WARN HiveMetaStore: Location: > file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external > table:tbl2 > spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') > SELECT * FROM tbl where 1=0; > 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2 > 18/07/26 22:49:36 WARN log: Updated size to 0 > spark-sql> SHOW PARTITIONS tbl2; > day=2018-07-25/hour=01 > spark-sql> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24937) Datasource partition table should load empty partitions
[ https://issues.apache.org/jira/browse/SPARK-24937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558386#comment-16558386 ] Yuming Wang commented on SPARK-24937: - I'm working on. > Datasource partition table should load empty partitions > --- > > Key: SPARK-24937 > URL: https://issues.apache.org/jira/browse/SPARK-24937 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:sql} > spark-sql> CREATE TABLE tbl AS SELECT 1; > 18/07/26 22:48:11 WARN HiveMetaStore: Location: > file:/Users/yumwang/tmp/spark/spark-warehouse/tbl specified for non-external > table:tbl > 18/07/26 22:48:15 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING) > > USING parquet > > PARTITIONED BY (day, hour); > spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') > SELECT * FROM tbl where 1=0; > spark-sql> SHOW PARTITIONS tbl1; > spark-sql> CREATE TABLE tbl2 (c1 BIGINT) > > PARTITIONED BY (day STRING, hour STRING); > 18/07/26 22:49:20 WARN HiveMetaStore: Location: > file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external > table:tbl2 > spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') > SELECT * FROM tbl where 1=0; > 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2 > 18/07/26 22:49:36 WARN log: Updated size to 0 > spark-sql> SHOW PARTITIONS tbl2; > day=2018-07-25/hour=01 > spark-sql> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24937) Datasource partition table should load empty partitions
Yuming Wang created SPARK-24937: --- Summary: Datasource partition table should load empty partitions Key: SPARK-24937 URL: https://issues.apache.org/jira/browse/SPARK-24937 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Yuming Wang {code:sql} spark-sql> CREATE TABLE tbl AS SELECT 1; 18/07/26 22:48:11 WARN HiveMetaStore: Location: file:/Users/yumwang/tmp/spark/spark-warehouse/tbl specified for non-external table:tbl 18/07/26 22:48:15 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING) > USING parquet > PARTITIONED BY (day, hour); spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') SELECT * FROM tbl where 1=0; spark-sql> SHOW PARTITIONS tbl1; spark-sql> CREATE TABLE tbl2 (c1 BIGINT) > PARTITIONED BY (day STRING, hour STRING); 18/07/26 22:49:20 WARN HiveMetaStore: Location: file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external table:tbl2 spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') SELECT * FROM tbl where 1=0; 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2 18/07/26 22:49:36 WARN log: Updated size to 0 spark-sql> SHOW PARTITIONS tbl2; day=2018-07-25/hour=01 spark-sql> {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24937) Datasource partition table should load empty partitions
[ https://issues.apache.org/jira/browse/SPARK-24937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24937: Description: How to reproduce: {code:sql} spark-sql> CREATE TABLE tbl AS SELECT 1; 18/07/26 22:48:11 WARN HiveMetaStore: Location: file:/Users/yumwang/tmp/spark/spark-warehouse/tbl specified for non-external table:tbl 18/07/26 22:48:15 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING) > USING parquet > PARTITIONED BY (day, hour); spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') SELECT * FROM tbl where 1=0; spark-sql> SHOW PARTITIONS tbl1; spark-sql> CREATE TABLE tbl2 (c1 BIGINT) > PARTITIONED BY (day STRING, hour STRING); 18/07/26 22:49:20 WARN HiveMetaStore: Location: file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external table:tbl2 spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') SELECT * FROM tbl where 1=0; 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2 18/07/26 22:49:36 WARN log: Updated size to 0 spark-sql> SHOW PARTITIONS tbl2; day=2018-07-25/hour=01 spark-sql> {code} was: {code:sql} spark-sql> CREATE TABLE tbl AS SELECT 1; 18/07/26 22:48:11 WARN HiveMetaStore: Location: file:/Users/yumwang/tmp/spark/spark-warehouse/tbl specified for non-external table:tbl 18/07/26 22:48:15 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING) > USING parquet > PARTITIONED BY (day, hour); spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') SELECT * FROM tbl where 1=0; spark-sql> SHOW PARTITIONS tbl1; spark-sql> CREATE TABLE tbl2 (c1 BIGINT) > PARTITIONED BY (day STRING, hour STRING); 18/07/26 22:49:20 WARN HiveMetaStore: Location: file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external table:tbl2 spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') SELECT * FROM tbl where 1=0; 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2 18/07/26 22:49:36 WARN log: Updated size to 0 spark-sql> SHOW PARTITIONS tbl2; day=2018-07-25/hour=01 spark-sql> {code} > Datasource partition table should load empty partitions > --- > > Key: SPARK-24937 > URL: https://issues.apache.org/jira/browse/SPARK-24937 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:sql} > spark-sql> CREATE TABLE tbl AS SELECT 1; > 18/07/26 22:48:11 WARN HiveMetaStore: Location: > file:/Users/yumwang/tmp/spark/spark-warehouse/tbl specified for non-external > table:tbl > 18/07/26 22:48:15 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING) > > USING parquet > > PARTITIONED BY (day, hour); > spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') > SELECT * FROM tbl where 1=0; > spark-sql> SHOW PARTITIONS tbl1; > spark-sql> CREATE TABLE tbl2 (c1 BIGINT) > > PARTITIONED BY (day STRING, hour STRING); > 18/07/26 22:49:20 WARN HiveMetaStore: Location: > file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external > table:tbl2 > spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') > SELECT * FROM tbl where 1=0; > 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2 > 18/07/26 22:49:36 WARN log: Updated size to 0 > spark-sql> SHOW PARTITIONS tbl2; > day=2018-07-25/hour=01 > spark-sql> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24934) Should handle missing upper/lower bounds cases in in-memory partition pruning
[ https://issues.apache.org/jira/browse/SPARK-24934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558364#comment-16558364 ] Hyukjin Kwon commented on SPARK-24934: -- np! BTW, the workaround will be turning off {{spark.sql.inMemoryColumnarStorage.partitionPruning}} although it'd be less performant. > Should handle missing upper/lower bounds cases in in-memory partition pruning > - > > Key: SPARK-24934 > URL: https://issues.apache.org/jira/browse/SPARK-24934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > For example, if array is used (where the lower and upper bounds for its > column batch are {{null}})), it looks wrongly filtering all data out: > {code} > scala> import org.apache.spark.sql.functions > import org.apache.spark.sql.functions > scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol") > df: org.apache.spark.sql.DataFrame = [arrayCol: array] > scala> > df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), > functions.lit("b".show() > ++ > |arrayCol| > ++ > | [a, b]| > ++ > scala> > df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), > functions.lit("b".show() > ++ > |arrayCol| > ++ > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24936) Better error message when trying a shuffle fetch over 2 GB
Imran Rashid created SPARK-24936: Summary: Better error message when trying a shuffle fetch over 2 GB Key: SPARK-24936 URL: https://issues.apache.org/jira/browse/SPARK-24936 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 2.4.0 Reporter: Imran Rashid After SPARK-24297, spark will try to fetch shuffle blocks to disk if their over 2GB. However, this will fail with an external shuffle service running < spark 2.2, with an unhelpful error message like: {noformat} 18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 (TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1 , xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message= org.apache.spark.shuffle.FetchFailedException: java.lang.UnsupportedOperationException at org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60) at org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136) ... {noformat} We can't do anything to make the shuffle succeed, in this situation, but we should fail with a better error message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24934) Should handle missing upper/lower bounds cases in in-memory partition pruning
[ https://issues.apache.org/jira/browse/SPARK-24934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558360#comment-16558360 ] David Vogelbacher commented on SPARK-24934: --- Thanks for opening and making the pr [~hyukjin.kwon]! > Should handle missing upper/lower bounds cases in in-memory partition pruning > - > > Key: SPARK-24934 > URL: https://issues.apache.org/jira/browse/SPARK-24934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > For example, if array is used (where the lower and upper bounds for its > column batch are {{null}})), it looks wrongly filtering all data out: > {code} > scala> import org.apache.spark.sql.functions > import org.apache.spark.sql.functions > scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol") > df: org.apache.spark.sql.DataFrame = [arrayCol: array] > scala> > df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), > functions.lit("b".show() > ++ > |arrayCol| > ++ > | [a, b]| > ++ > scala> > df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), > functions.lit("b".show() > ++ > |arrayCol| > ++ > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24935) Problem with Executing Hive UDF's from Spark 2.2 Onwards
Parth Gandhi created SPARK-24935: Summary: Problem with Executing Hive UDF's from Spark 2.2 Onwards Key: SPARK-24935 URL: https://issues.apache.org/jira/browse/SPARK-24935 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1, 2.2.0 Reporter: Parth Gandhi A user of sketches library(https://github.com/DataSketches/sketches-hive) reported an issue with HLL Sketch Hive UDAF that seems to be a bug in Spark or Hive. Their code runs fine in 2.1 but has an issue from 2.2 onwards. For more details on the issue, you can refer to the discussion in the sketches-user list: [https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ] On further debugging, we figured out that from 2.2 onwards, Spark hive UDAF provides support for partial aggregation, and has removed the functionality that supported complete mode aggregation(Refer https://issues.apache.org/jira/browse/SPARK-19060 and https://issues.apache.org/jira/browse/SPARK-18186). Thus, instead of expecting update method to be called, merge method is called here ([https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56)] which throws the exception as described in the forums above. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24918) Executor Plugin API
[ https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558331#comment-16558331 ] Thomas Graves commented on SPARK-24918: --- I think this is a good idea. I thought I had seen a Jira around this before but couldn't find it. It might have been a task run pre-hook Its also good question about what we tie into it. I haven't looked at the details of your spark-memory debugging module, what does that class all need? I could see people doing all sorts of things from checking node health to preloading something, etc. so we should definitely think about the possibilities here and what we may or may not want to allow. > Executor Plugin API > --- > > Key: SPARK-24918 > URL: https://issues.apache.org/jira/browse/SPARK-24918 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > It would be nice if we could specify an arbitrary class to run within each > executor for debugging and instrumentation. Its hard to do this currently > because: > a) you have no idea when executors will come and go with DynamicAllocation, > so don't have a chance to run custom code before the first task > b) even with static allocation, you'd have to change the code of your spark > app itself to run a special task to "install" the plugin, which is often > tough in production cases when those maintaining regularly running > applications might not even know how to make changes to the application. > For example, https://github.com/squito/spark-memory could be used in a > debugging context to understand memory use, just by re-running an application > with extra command line arguments (as opposed to rebuilding spark). > I think one tricky part here is just deciding the api, and how its versioned. > Does it just get created when the executor starts, and thats it? Or does it > get more specific events, like task start, task end, etc? Would we ever add > more events? It should definitely be a {{DeveloperApi}}, so breaking > compatibility would be allowed ... but still should be avoided. We could > create a base class that has no-op implementations, or explicitly version > everything. > Note that this is not needed in the driver as we already have SparkListeners > (even if you don't care about the SparkListenerEvents and just want to > inspect objects in the JVM, its still good enough). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24934) Should handle missing upper/lower bounds cases in in-memory partition pruning
[ https://issues.apache.org/jira/browse/SPARK-24934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24934: Assignee: Apache Spark > Should handle missing upper/lower bounds cases in in-memory partition pruning > - > > Key: SPARK-24934 > URL: https://issues.apache.org/jira/browse/SPARK-24934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > For example, if array is used (where the lower and upper bounds for its > column batch are {{null}})), it looks wrongly filtering all data out: > {code} > scala> import org.apache.spark.sql.functions > import org.apache.spark.sql.functions > scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol") > df: org.apache.spark.sql.DataFrame = [arrayCol: array] > scala> > df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), > functions.lit("b".show() > ++ > |arrayCol| > ++ > | [a, b]| > ++ > scala> > df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), > functions.lit("b".show() > ++ > |arrayCol| > ++ > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24934) Should handle missing upper/lower bounds cases in in-memory partition pruning
[ https://issues.apache.org/jira/browse/SPARK-24934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24934: Assignee: (was: Apache Spark) > Should handle missing upper/lower bounds cases in in-memory partition pruning > - > > Key: SPARK-24934 > URL: https://issues.apache.org/jira/browse/SPARK-24934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > For example, if array is used (where the lower and upper bounds for its > column batch are {{null}})), it looks wrongly filtering all data out: > {code} > scala> import org.apache.spark.sql.functions > import org.apache.spark.sql.functions > scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol") > df: org.apache.spark.sql.DataFrame = [arrayCol: array] > scala> > df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), > functions.lit("b".show() > ++ > |arrayCol| > ++ > | [a, b]| > ++ > scala> > df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), > functions.lit("b".show() > ++ > |arrayCol| > ++ > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24934) Should handle missing upper/lower bounds cases in in-memory partition pruning
[ https://issues.apache.org/jira/browse/SPARK-24934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558295#comment-16558295 ] Apache Spark commented on SPARK-24934: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/21882 > Should handle missing upper/lower bounds cases in in-memory partition pruning > - > > Key: SPARK-24934 > URL: https://issues.apache.org/jira/browse/SPARK-24934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > For example, if array is used (where the lower and upper bounds for its > column batch are {{null}})), it looks wrongly filtering all data out: > {code} > scala> import org.apache.spark.sql.functions > import org.apache.spark.sql.functions > scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol") > df: org.apache.spark.sql.DataFrame = [arrayCol: array] > scala> > df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), > functions.lit("b".show() > ++ > |arrayCol| > ++ > | [a, b]| > ++ > scala> > df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), > functions.lit("b".show() > ++ > |arrayCol| > ++ > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24934) Should handle missing upper/lower bounds cases in in-memory partition pruning
Hyukjin Kwon created SPARK-24934: Summary: Should handle missing upper/lower bounds cases in in-memory partition pruning Key: SPARK-24934 URL: https://issues.apache.org/jira/browse/SPARK-24934 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Hyukjin Kwon For example, if array is used (where the lower and upper bounds for its column batch are {{null}})), it looks wrongly filtering all data out: {code} scala> import org.apache.spark.sql.functions import org.apache.spark.sql.functions scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol") df: org.apache.spark.sql.DataFrame = [arrayCol: array] scala> df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), functions.lit("b".show() ++ |arrayCol| ++ | [a, b]| ++ scala> df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), functions.lit("b".show() ++ |arrayCol| ++ ++ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12911) Cacheing a dataframe causes array comparisons to fail (in filter / where) after 1.6
[ https://issues.apache.org/jira/browse/SPARK-12911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558291#comment-16558291 ] Hyukjin Kwon commented on SPARK-12911: -- I opened - SPARK-24934 > Cacheing a dataframe causes array comparisons to fail (in filter / where) > after 1.6 > --- > > Key: SPARK-12911 > URL: https://issues.apache.org/jira/browse/SPARK-12911 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.6.0 >Reporter: Jesse English >Priority: Major > > When doing a *where* operation on a dataframe and testing for equality on an > array type, after 1.6 no valid comparisons are made if the dataframe has been > cached. If it has not been cached, the results are as expected. > This appears to be related to the underlying unsafe array data types. > {code:title=test.scala|borderStyle=solid} > test("test array comparison") { > val vectors: Vector[Row] = Vector( > Row.fromTuple("id_1" -> Array(0L, 2L)), > Row.fromTuple("id_2" -> Array(0L, 5L)), > Row.fromTuple("id_3" -> Array(0L, 9L)), > Row.fromTuple("id_4" -> Array(1L, 0L)), > Row.fromTuple("id_5" -> Array(1L, 8L)), > Row.fromTuple("id_6" -> Array(2L, 4L)), > Row.fromTuple("id_7" -> Array(5L, 6L)), > Row.fromTuple("id_8" -> Array(6L, 2L)), > Row.fromTuple("id_9" -> Array(7L, 0L)) > ) > val data: RDD[Row] = sc.parallelize(vectors, 3) > val schema = StructType( > StructField("id", StringType, false) :: > StructField("point", DataTypes.createArrayType(LongType, false), > false) :: > Nil > ) > val sqlContext = new SQLContext(sc) > val dataframe = sqlContext.createDataFrame(data, schema) > val targetPoint:Array[Long] = Array(0L,9L) > //Cacheing is the trigger to cause the error (no cacheing causes no error) > dataframe.cache() > //This is the line where it fails > //java.util.NoSuchElementException: next on empty iterator > //However we know that there is a valid match > val targetRow = dataframe.where(dataframe("point") === > array(targetPoint.map(value => lit(value)): _*)).first() > assert(targetRow != null) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24928) spark sql cross join running time too long
[ https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558287#comment-16558287 ] Marco Gaido commented on SPARK-24928: - The affected version is pretty old, can you check a newer version? > spark sql cross join running time too long > -- > > Key: SPARK-24928 > URL: https://issues.apache.org/jira/browse/SPARK-24928 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.2 >Reporter: LIFULONG >Priority: Minor > > spark sql running time is too long while input left table and right table is > small hdfs text format data, > the sql is: select * from t1 cross join t2 > the line of t1 is 49, three column > the line of t2 is 1, one column only > running more than 30mins and then failed > > > spark CartesianRDD also has the same problem, example test code is: > val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b") //1 line > 1 column > val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b") //49 > line 3 column > val cartesian = new CartesianRDD(sc, twos, ones) > cartesian.count() > running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use > less than 10 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24933) SinkProgress should report written rows
Vaclav Kosar created SPARK-24933: Summary: SinkProgress should report written rows Key: SPARK-24933 URL: https://issues.apache.org/jira/browse/SPARK-24933 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.3.1 Reporter: Vaclav Kosar SinkProgress should report similar properties like SourceProgress as long as they are available for given Sink. Count of written rows is metric availble for all Sinks. Since relevant progress information is with respect to commited rows, ideal object to carry this info is WriterCommitMessage. For brevity the implementation will focus only on Sinks with API V2 and on Micro Batch mode. Implemention for Continuous mode will be provided at later date. h4. Before {code} {"description":"org.apache.spark.sql.kafka010.KafkaSourceProvider@3c0bd317"} {code} h4. After {code} {"description":"org.apache.spark.sql.kafka010.KafkaSourceProvider@3c0bd317","numOutputRows":5000} {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24932) Allow update mode for streaming queries with join
Eric Fu created SPARK-24932: --- Summary: Allow update mode for streaming queries with join Key: SPARK-24932 URL: https://issues.apache.org/jira/browse/SPARK-24932 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.3.1, 2.3.0 Reporter: Eric Fu In issue SPARK-19140 we supported update output mode for non-aggregation streaming queries. This should also be applied to streaming join to keep semantic consistent. PS. Streaming join feature is added after SPARK-19140. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24931) CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which leading to job failed.
[ https://issues.apache.org/jira/browse/SPARK-24931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ice bai updated SPARK-24931: Priority: Major (was: Blocker) > CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which > leading to job failed. > - > > Key: SPARK-24931 > URL: https://issues.apache.org/jira/browse/SPARK-24931 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: ice bai >Priority: Major > > when executor lost for some reason(e.g. Unable to register with external > shuffle server),CoarseGrainedExecutorBackend will send a RemoveExecutor event > with 'ExecutorLossReason'. But this will cause TaskSetManager handle > handleFailedTask function with exitCausedByApp=true. This is not correct -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24931) CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which leading to job failed.
[ https://issues.apache.org/jira/browse/SPARK-24931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ice bai updated SPARK-24931: Summary: CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which leading to job failed. (was: ExecutorBackend send wrong Reason when executor exits) > CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which > leading to job failed. > - > > Key: SPARK-24931 > URL: https://issues.apache.org/jira/browse/SPARK-24931 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: ice bai >Priority: Blocker > > when executor lost for some reason(e.g. Unable to register with external > shuffle server),CoarseGrainedExecutorBackend will send a RemoveExecutor event > with 'ExecutorLossReason'. But this will cause TaskSetManager handle > handleFailedTask function with exitCausedByApp=true. This is not correct -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24931) ExecutorBackend send wrong Reason when executor exits
ice bai created SPARK-24931: --- Summary: ExecutorBackend send wrong Reason when executor exits Key: SPARK-24931 URL: https://issues.apache.org/jira/browse/SPARK-24931 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: ice bai when executor lost for some reason(e.g. Unable to register with external shuffle server),CoarseGrainedExecutorBackend will send a RemoveExecutor event with 'ExecutorLossReason'. But this will cause TaskSetManager handle handleFailedTask function with exitCausedByApp=true. This is not correct -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24527) select column alias should support quotation marks
[ https://issues.apache.org/jira/browse/SPARK-24527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ice bai updated SPARK-24527: Description: It will be failed when user use spark-sql or sql API to select come columns with quoted alias, but Hive is ok. Such as : select 'name' as 'nm'; select 'name' as "nm"; was: It will be failed when user use spark-sql or sql API to select come columns with quoted alias, but Hive is well. Such as : select 'name' as 'nm'; select 'name' as "nm"; > select column alias should support quotation marks > -- > > Key: SPARK-24527 > URL: https://issues.apache.org/jira/browse/SPARK-24527 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: ice bai >Priority: Minor > > It will be failed when user use spark-sql or sql API to select come columns > with quoted alias, but Hive is ok. Such as : > select 'name' as 'nm'; > select 'name' as "nm"; -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24647) Sink Should Return Writen Offsets For ProgressReporting
[ https://issues.apache.org/jira/browse/SPARK-24647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vaclav Kosar updated SPARK-24647: - Description: To be able to track data lineage for Structured Streaming (I intend to implement this to Open Source Project Spline), the monitoring needs to be able to not only to track where the data was read from but also where results were written to. This could be to my knowledge best implemented using monitoring {{StreamingQueryProgress}}. However currently written data offsets are not available on {{Sink}} or {{StreamWriter}} interface. Implementing as proposed would also bring symmetry to {{StreamingQueryProgress}} fields sources and sink. *Similar Proposals* Made in following jiras. These would not be sufficient for lineage tracking. * https://issues.apache.org/jira/browse/SPARK-18258 * https://issues.apache.org/jira/browse/SPARK-21313 *Current State* * Method {{Sink#addBatch}} returns {{Unit}}. * Object {{WriterCommitMessage}} does not carry any progress information about comitted rows. * {{StreamingQueryProgress}} reports {{offsetSeq}} start and end using {{sourceProgress}} value but {{sinkProgress}} only calls {{toString}} method. {code:java} "sources" : [ { "description" : "KafkaSource[Subscribe[test-topic]]", "startOffset" : null, "endOffset" : { "test-topic" : { "0" : 5000 }}, "numInputRows" : 5000, "processedRowsPerSecond" : 645.3278265358803 } ], "sink" : { "description" : "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f" } {code} *Proposed State* * Implement support only for v2 sinks as those are to use used in future. * {{WriterCommitMessage}} to hold optional min and max offset information of commited rows e.g. Kafka does it by returning {{RecordMetadata}} object from {{send}} method. * {{StreamingQueryProgress}} incorporate {{sinkProgress}} in similar fashion as {{sourceProgress}}. {code:java} "sources" : [ { "description" : "KafkaSource[Subscribe[test-topic]]", "startOffset" : null, "endOffset" : { "test-topic" : { "0" : 5000 }}, "numInputRows" : 5000, "processedRowsPerSecond" : 645.3278265358803 } ], "sink" : { "description" : "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f", "startOffset" : null, "endOffset" { "sinkTopic": { "0": 333 }} } {code} *Implementation* * PR submitters: Me and [~wajda] as soon as prerequisite jira is merged. * {{Sinks}}: Modify all v2 sinks to conform a new interface or return dummy values. * {{ProgressReporter}}: Merge offsets from different batches properly, similarly to how it is done for sources. was: To be able to track data lineage for Structured Streaming (I intend to implement this to Open Source Project Spline), the monitoring needs to be able to not only to track where the data was read from but also where results were written to. This could be to my knowledge best implemented using monitoring {{StreamingQueryProgress}}. However currently batch data offsets are not available on {{Sink}} interface. Implementing as proposed would also bring symmetry to {{StreamingQueryProgress}} fields sources and sink. *Similar Proposals* Made in following jiras. These would not be sufficient for lineage tracking. * https://issues.apache.org/jira/browse/SPARK-18258 * https://issues.apache.org/jira/browse/SPARK-21313 *Current State* * Method {{Sink#addBatch}} returns {{Unit}}. * {{StreamingQueryProgress}} reports {{offsetSeq}} start and end using {{sourceProgress}} value but {{sinkProgress}} only calls {{toString}} method. {code:java} "sources" : [ { "description" : "KafkaSource[Subscribe[test-topic]]", "startOffset" : null, "endOffset" : { "test-topic" : { "0" : 5000 }}, "numInputRows" : 5000, "processedRowsPerSecond" : 645.3278265358803 } ], "sink" : { "description" : "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f" } {code} *Proposed State* * {{Sink#addBatch}} to return {{OffsetSeq}} or {{StreamProgress}} specifying offsets of the written batch, e.g. Kafka does it by returning {{RecordMetadata}} object from {{send}} method. * {{StreamingQueryProgress}} incorporate {{sinkProgress}} in similar fashion as {{sourceProgress}}. {code:java} "sources" : [ { "description" : "KafkaSource[Subscribe[test-topic]]", "startOffset" : null, "endOffset" : { "test-topic" : { "0" : 5000 }}, "numInputRows" : 5000, "processedRowsPerSecond" : 645.3278265358803 } ], "sink" : { "description" : "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f", "startOffset" : null, "endOffset" { "sinkTopic": { "0": 333 }} } {code} *Implementation* * PR submitters: Likely will be me and [~wajda] as soon as the discussion ends positively. * {{Sinks}}: Modify all sinks to conform a new interface or return dummy values
[jira] [Updated] (SPARK-24647) Sink Should Return Writen Offsets For ProgressReporting
[ https://issues.apache.org/jira/browse/SPARK-24647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vaclav Kosar updated SPARK-24647: - Summary: Sink Should Return Writen Offsets For ProgressReporting (was: Sink Should Return OffsetSeqs For ProgressReporting) > Sink Should Return Writen Offsets For ProgressReporting > --- > > Key: SPARK-24647 > URL: https://issues.apache.org/jira/browse/SPARK-24647 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Vaclav Kosar >Priority: Major > > To be able to track data lineage for Structured Streaming (I intend to > implement this to Open Source Project Spline), the monitoring needs to be > able to not only to track where the data was read from but also where results > were written to. This could be to my knowledge best implemented using > monitoring {{StreamingQueryProgress}}. However currently batch data offsets > are not available on {{Sink}} interface. Implementing as proposed would also > bring symmetry to {{StreamingQueryProgress}} fields sources and sink. > > *Similar Proposals* > Made in following jiras. These would not be sufficient for lineage tracking. > * https://issues.apache.org/jira/browse/SPARK-18258 > * https://issues.apache.org/jira/browse/SPARK-21313 > > *Current State* > * Method {{Sink#addBatch}} returns {{Unit}}. > * {{StreamingQueryProgress}} reports {{offsetSeq}} start and end using > {{sourceProgress}} value but {{sinkProgress}} only calls {{toString}} method. > {code:java} > "sources" : [ { > "description" : "KafkaSource[Subscribe[test-topic]]", > "startOffset" : null, > "endOffset" : { "test-topic" : { "0" : 5000 }}, > "numInputRows" : 5000, > "processedRowsPerSecond" : 645.3278265358803 > } ], > "sink" : { > "description" : > "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f" > } > {code} > > > *Proposed State* > * {{Sink#addBatch}} to return {{OffsetSeq}} or {{StreamProgress}} specifying > offsets of the written batch, e.g. Kafka does it by returning > {{RecordMetadata}} object from {{send}} method. > * {{StreamingQueryProgress}} incorporate {{sinkProgress}} in similar fashion > as {{sourceProgress}}. > > > {code:java} > "sources" : [ { > "description" : "KafkaSource[Subscribe[test-topic]]", > "startOffset" : null, > "endOffset" : { "test-topic" : { "0" : 5000 }}, > "numInputRows" : 5000, > "processedRowsPerSecond" : 645.3278265358803 > } ], > "sink" : { > "description" : > "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f", > "startOffset" : null, > "endOffset" { "sinkTopic": { "0": 333 }} > } > {code} > > *Implementation* > * PR submitters: Likely will be me and [~wajda] as soon as the discussion > ends positively. > * {{Sinks}}: Modify all sinks to conform a new interface or return dummy > values. > * {{ProgressReporter}}: Merge offsets from different batches properly, > similarly to how it is done for sources. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24897) DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for stage fetchFailed
[ https://issues.apache.org/jira/browse/SPARK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558144#comment-16558144 ] liupengcheng commented on SPARK-24897: -- already fixed at 2.x > DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for > stage fetchFailed > -- > > Key: SPARK-24897 > URL: https://issues.apache.org/jira/browse/SPARK-24897 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.6.1 >Reporter: liupengcheng >Priority: Major > > In Spark2.1, when a stage fetchfailed DAGScheduler will retry both this stage > and it's parent stage, however, when the parent stage is resubmitted and > start running, the mapstatuses can > still be invalidate by the stage's outstanding task due to fetchfailed. > The stage's outstanding task might unregister the mapstatuses with new epoch, > thus causing > the parent stage repeated MetadataFetchFailed and finally failling the Job. > > > {code:java} > 2018-07-23,01:52:33,012 WARN org.apache.spark.scheduler.TaskSetManager: Lost > task 174.0 in stage 71.0 (TID 154127, , executor 96): > FetchFailed(BlockManagerId(4945, , 22409), shuffleId=24, mapId=667, > reduceId=174, message= org.apache.spark.shuffle.FetchFailedException: Failed > to connect to /:22409 > 2018-07-23,01:52:33,013 INFO org.apache.spark.scheduler.DAGScheduler: Marking > ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) as failed due to a > fetch failure from ShuffleMapStage 69 ($plus$plus at > DeviceLocateMain.scala:236) > 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: > ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) failed in 246.856 s > 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: > Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) > and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch > failure > 2018-07-23,01:52:36,004 WARN org.apache.spark.scheduler.TaskSetManager: Lost > task 120.0 in stage 71.0 (TID 154073, , executor 286): > FetchFailed(BlockManagerId(4208, , 22409), shuffleId=24, mapId=241, > reduceId=120, message= org.apache.spark.shuffle.FetchFailedException: Failed > to connect to /:22409 > 2018-07-23,01:52:36,005 INFO org.apache.spark.scheduler.DAGScheduler: > Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) > and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch > failure > 2018-07-23,01:52:36,017 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece0 stored as bytes in memory (estimated size 4.0 MB, free > 26.7 MB) > 2018-07-23,01:52:36,025 INFO org.apache.spark.storage.BlockManagerInfo: > Removed broadcast_59_piece1 on :52349 in memory (size: 4.0 MB, free: > 3.0 GB) > 2018-07-23,01:52:36,029 INFO org.apache.spark.storage.BlockManagerInfo: > Removed broadcast_61_piece6 on :52349 in memory (size: 4.0 MB, free: > 3.0 GB) > 2018-07-23,01:52:36,079 INFO org.apache.spark.deploy.yarn.YarnAllocator: > Canceling requests for 0 executor containers > 2018-07-23,01:52:36,079 WARN org.apache.spark.deploy.yarn.YarnAllocator: > Expected to find pending requests, but found none. > 2018-07-23,01:52:36,094 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_63_piece0 in memory on :56780 (size: 4.0 MB, free: 3.7 > GB) > 2018-07-23,01:52:36,095 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece1 stored as bytes in memory (estimated size 4.0 MB, free > 30.7 MB) > 2018-07-23,01:52:36,107 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece1 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) > 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece2 stored as bytes in memory (estimated size 4.0 MB, free > 34.7 MB) > 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece2 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) > 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece3 stored as bytes in memory (estimated size 4.0 MB, free > 38.7 MB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece3 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece4 stored as bytes in memory (estimated size 3.8 MB, free > 42.5 MB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece4 in memory on :56780 (size: 3.8 MB, free: 3.7 GB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.MapOutputTracker: Broadcast > mapstatu
[jira] [Updated] (SPARK-24897) DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for stage fetchFailed
[ https://issues.apache.org/jira/browse/SPARK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liupengcheng updated SPARK-24897: - Affects Version/s: (was: 2.3.1) (was: 2.1.0) 1.6.1 > DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for > stage fetchFailed > -- > > Key: SPARK-24897 > URL: https://issues.apache.org/jira/browse/SPARK-24897 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.6.1 >Reporter: liupengcheng >Priority: Major > > In Spark2.1, when a stage fetchfailed DAGScheduler will retry both this stage > and it's parent stage, however, when the parent stage is resubmitted and > start running, the mapstatuses can > still be invalidate by the stage's outstanding task due to fetchfailed. > The stage's outstanding task might unregister the mapstatuses with new epoch, > thus causing > the parent stage repeated MetadataFetchFailed and finally failling the Job. > > > {code:java} > 2018-07-23,01:52:33,012 WARN org.apache.spark.scheduler.TaskSetManager: Lost > task 174.0 in stage 71.0 (TID 154127, , executor 96): > FetchFailed(BlockManagerId(4945, , 22409), shuffleId=24, mapId=667, > reduceId=174, message= org.apache.spark.shuffle.FetchFailedException: Failed > to connect to /:22409 > 2018-07-23,01:52:33,013 INFO org.apache.spark.scheduler.DAGScheduler: Marking > ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) as failed due to a > fetch failure from ShuffleMapStage 69 ($plus$plus at > DeviceLocateMain.scala:236) > 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: > ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) failed in 246.856 s > 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: > Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) > and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch > failure > 2018-07-23,01:52:36,004 WARN org.apache.spark.scheduler.TaskSetManager: Lost > task 120.0 in stage 71.0 (TID 154073, , executor 286): > FetchFailed(BlockManagerId(4208, , 22409), shuffleId=24, mapId=241, > reduceId=120, message= org.apache.spark.shuffle.FetchFailedException: Failed > to connect to /:22409 > 2018-07-23,01:52:36,005 INFO org.apache.spark.scheduler.DAGScheduler: > Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) > and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch > failure > 2018-07-23,01:52:36,017 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece0 stored as bytes in memory (estimated size 4.0 MB, free > 26.7 MB) > 2018-07-23,01:52:36,025 INFO org.apache.spark.storage.BlockManagerInfo: > Removed broadcast_59_piece1 on :52349 in memory (size: 4.0 MB, free: > 3.0 GB) > 2018-07-23,01:52:36,029 INFO org.apache.spark.storage.BlockManagerInfo: > Removed broadcast_61_piece6 on :52349 in memory (size: 4.0 MB, free: > 3.0 GB) > 2018-07-23,01:52:36,079 INFO org.apache.spark.deploy.yarn.YarnAllocator: > Canceling requests for 0 executor containers > 2018-07-23,01:52:36,079 WARN org.apache.spark.deploy.yarn.YarnAllocator: > Expected to find pending requests, but found none. > 2018-07-23,01:52:36,094 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_63_piece0 in memory on :56780 (size: 4.0 MB, free: 3.7 > GB) > 2018-07-23,01:52:36,095 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece1 stored as bytes in memory (estimated size 4.0 MB, free > 30.7 MB) > 2018-07-23,01:52:36,107 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece1 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) > 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece2 stored as bytes in memory (estimated size 4.0 MB, free > 34.7 MB) > 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece2 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) > 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece3 stored as bytes in memory (estimated size 4.0 MB, free > 38.7 MB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece3 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece4 stored as bytes in memory (estimated size 3.8 MB, free > 42.5 MB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece4 in memory on :56780 (size: 3.8 MB, free: 3.7 GB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.MapOutput
[jira] [Resolved] (SPARK-24897) DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for stage fetchFailed
[ https://issues.apache.org/jira/browse/SPARK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liupengcheng resolved SPARK-24897. -- Resolution: Invalid > DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for > stage fetchFailed > -- > > Key: SPARK-24897 > URL: https://issues.apache.org/jira/browse/SPARK-24897 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.1.0, 2.3.1 >Reporter: liupengcheng >Priority: Major > > In Spark2.1, when a stage fetchfailed DAGScheduler will retry both this stage > and it's parent stage, however, when the parent stage is resubmitted and > start running, the mapstatuses can > still be invalidate by the stage's outstanding task due to fetchfailed. > The stage's outstanding task might unregister the mapstatuses with new epoch, > thus causing > the parent stage repeated MetadataFetchFailed and finally failling the Job. > > > {code:java} > 2018-07-23,01:52:33,012 WARN org.apache.spark.scheduler.TaskSetManager: Lost > task 174.0 in stage 71.0 (TID 154127, , executor 96): > FetchFailed(BlockManagerId(4945, , 22409), shuffleId=24, mapId=667, > reduceId=174, message= org.apache.spark.shuffle.FetchFailedException: Failed > to connect to /:22409 > 2018-07-23,01:52:33,013 INFO org.apache.spark.scheduler.DAGScheduler: Marking > ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) as failed due to a > fetch failure from ShuffleMapStage 69 ($plus$plus at > DeviceLocateMain.scala:236) > 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: > ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) failed in 246.856 s > 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: > Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) > and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch > failure > 2018-07-23,01:52:36,004 WARN org.apache.spark.scheduler.TaskSetManager: Lost > task 120.0 in stage 71.0 (TID 154073, , executor 286): > FetchFailed(BlockManagerId(4208, , 22409), shuffleId=24, mapId=241, > reduceId=120, message= org.apache.spark.shuffle.FetchFailedException: Failed > to connect to /:22409 > 2018-07-23,01:52:36,005 INFO org.apache.spark.scheduler.DAGScheduler: > Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) > and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch > failure > 2018-07-23,01:52:36,017 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece0 stored as bytes in memory (estimated size 4.0 MB, free > 26.7 MB) > 2018-07-23,01:52:36,025 INFO org.apache.spark.storage.BlockManagerInfo: > Removed broadcast_59_piece1 on :52349 in memory (size: 4.0 MB, free: > 3.0 GB) > 2018-07-23,01:52:36,029 INFO org.apache.spark.storage.BlockManagerInfo: > Removed broadcast_61_piece6 on :52349 in memory (size: 4.0 MB, free: > 3.0 GB) > 2018-07-23,01:52:36,079 INFO org.apache.spark.deploy.yarn.YarnAllocator: > Canceling requests for 0 executor containers > 2018-07-23,01:52:36,079 WARN org.apache.spark.deploy.yarn.YarnAllocator: > Expected to find pending requests, but found none. > 2018-07-23,01:52:36,094 INFO org.apache.spark.storage.BlockManagerInfo: > Added broadcast_63_piece0 in memory on :56780 (size: 4.0 MB, free: 3.7 > GB) > 2018-07-23,01:52:36,095 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece1 stored as bytes in memory (estimated size 4.0 MB, free > 30.7 MB) > 2018-07-23,01:52:36,107 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece1 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) > 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece2 stored as bytes in memory (estimated size 4.0 MB, free > 34.7 MB) > 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece2 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) > 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece3 stored as bytes in memory (estimated size 4.0 MB, free > 38.7 MB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece3 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.MemoryStore: Block > broadcast_63_piece4 stored as bytes in memory (estimated size 3.8 MB, free > 42.5 MB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added > broadcast_63_piece4 in memory on :56780 (size: 3.8 MB, free: 3.7 GB) > 2018-07-23,01:52:36,132 INFO org.apache.spark.MapOutputTracker: Broadcast > mapstatuses size = 384, actual size = 20784475 > 2018
[jira] [Updated] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
[ https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaochen Ouyang updated SPARK-24930: Description: # root user create a test.txt file contains a record '123' in /root/ directory # switch mr user to execute spark-shell --master local {code:java} scala> spark.version res2: String = 2.2.1 scala> spark.sql("create table t1(id int) partitioned by(area string)"); 2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: Location: hdfs://nameservice/spark/t1 specified for non-external table:t1 res4: org.apache.spark.sql.DataFrame = [] scala> spark.sql("load data local inpath '/root/test.txt' into table t1 partition(area ='025')") org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: /root/test.txt; at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:183) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639) ... 48 elided scala> {code} In fact, the input path exists, but the mr user does not have permission to access the directory `/root/` ,so the message throwed by `AnalysisException` can confuse user. was: # root user create a test.txt file contains a record '123' in /root/ directory # switch mr user to execute spark-shell --master local {code:java} scala> spark.version res2: String = 2.2.1 scala> spark.sql("create table t1(id int) partitioned by(area string)"); 2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: Location: hdfs://nameservice/spark/t1 specified for non-external table:t1 res4: org.apache.spark.sql.DataFrame = [] scala> spark.sql("load data local inpath '/root/test.txt' into table t1 partition(area ='025')") org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: /root/test.txt; at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:183) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639) ... 48 elided scala> {code} In fact, the input path exists, but the mr user does not have permission to access the directory `/root/` ,so the message throwed by `AnalysisException` can misleading user fix problem. > Exception information is not accurate when using `LOAD DATA LOCAL INPATH` > -- > > Key: SPARK-24930 > URL: https://issues.apache.org/jira/browse/SPARK-24930 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Xiaochen Ouyang >Priority: Major > > # root user create a test.txt file contains a record '123' in /root/ > directory > # switch mr user to execute spark-shell --master local > {code:java} > scala> spark.version > res2: String = 2.2.1 > scala> spark.sql("create table t1(id int) partitioned by(area string)"); > 2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: > Location: hdfs://nameservice/spark/t1 specified for non-external table:t1 > res4: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("load data local inpath '/root/test.txt' into table t1 > partition(area ='025')") > org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: > /root/test.txt; > at > org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639) > ... 48 elided > scala> > {code} > In fact, the input path exists, but the mr user does not have permission to > access the directory `/root/` ,so the messag
[jira] [Updated] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
[ https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaochen Ouyang updated SPARK-24930: Description: # root user create a test.txt file contains a record '123' in /root/ directory # switch mr user to execute spark-shell --master local {code:java} scala> spark.version res2: String = 2.2.1 scala> spark.sql("create table t1(id int) partitioned by(area string)"); 2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: Location: hdfs://nameservice/spark/t1 specified for non-external table:t1 res4: org.apache.spark.sql.DataFrame = [] scala> spark.sql("load data local inpath '/root/test.txt' into table t1 partition(area ='025')") org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: /root/test.txt; at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:183) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639) ... 48 elided scala> {code} In fact, the input path exists, but the mr user does not have permission to access the directory `/root/` ,so the message throwed by `AnalysisException` can misleading user fix problem. was: # root user create a test.txt file contains a record '123' in /root/ directory # switch mr user to execute spark-shell --master local {code:java} scala> spark.version res2: String = 2.2.1 scala> spark.sql("load data local inpath '/root/test.txt' into table t1 partition(area ='025')") org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: /root/test.txt; at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:183) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639) ... 48 elided scala> {code} In fact, the input path exists, but the mr user does not have permission to access the directory `/root/` ,so the message throwed by `AnalysisException` can misleading user fix problem. > Exception information is not accurate when using `LOAD DATA LOCAL INPATH` > -- > > Key: SPARK-24930 > URL: https://issues.apache.org/jira/browse/SPARK-24930 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Xiaochen Ouyang >Priority: Major > > # root user create a test.txt file contains a record '123' in /root/ > directory > # switch mr user to execute spark-shell --master local > {code:java} > scala> spark.version > res2: String = 2.2.1 > scala> spark.sql("create table t1(id int) partitioned by(area string)"); > 2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: > Location: hdfs://nameservice/spark/t1 specified for non-external table:t1 > res4: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("load data local inpath '/root/test.txt' into table t1 > partition(area ='025')") > org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: > /root/test.txt; > at > org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639) > ... 48 elided > scala> > {code} > In fact, the input path exists, but the mr user does not have permission to > access the directory `/root/` ,so the message throwed by `AnalysisException` > can misleading user fix problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.a
[jira] [Assigned] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
[ https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24930: Assignee: (was: Apache Spark) > Exception information is not accurate when using `LOAD DATA LOCAL INPATH` > -- > > Key: SPARK-24930 > URL: https://issues.apache.org/jira/browse/SPARK-24930 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Xiaochen Ouyang >Priority: Major > > # root user create a test.txt file contains a record '123' in /root/ > directory > # switch mr user to execute spark-shell --master local > {code:java} > scala> spark.version > res2: String = 2.2.1 > scala> spark.sql("load data local inpath '/root/test.txt' into table t1 > partition(area ='025')") > org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: > /root/test.txt; > at > org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639) > ... 48 elided > scala> > {code} > In fact, the input path exists, but the mr user does not have permission to > access the directory `/root/` ,so the message throwed by `AnalysisException` > can misleading user fix problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
[ https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558090#comment-16558090 ] Apache Spark commented on SPARK-24930: -- User 'ouyangxiaochen' has created a pull request for this issue: https://github.com/apache/spark/pull/21881 > Exception information is not accurate when using `LOAD DATA LOCAL INPATH` > -- > > Key: SPARK-24930 > URL: https://issues.apache.org/jira/browse/SPARK-24930 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Xiaochen Ouyang >Priority: Major > > # root user create a test.txt file contains a record '123' in /root/ > directory > # switch mr user to execute spark-shell --master local > {code:java} > scala> spark.version > res2: String = 2.2.1 > scala> spark.sql("load data local inpath '/root/test.txt' into table t1 > partition(area ='025')") > org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: > /root/test.txt; > at > org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639) > ... 48 elided > scala> > {code} > In fact, the input path exists, but the mr user does not have permission to > access the directory `/root/` ,so the message throwed by `AnalysisException` > can misleading user fix problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
[ https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24930: Assignee: Apache Spark > Exception information is not accurate when using `LOAD DATA LOCAL INPATH` > -- > > Key: SPARK-24930 > URL: https://issues.apache.org/jira/browse/SPARK-24930 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Xiaochen Ouyang >Assignee: Apache Spark >Priority: Major > > # root user create a test.txt file contains a record '123' in /root/ > directory > # switch mr user to execute spark-shell --master local > {code:java} > scala> spark.version > res2: String = 2.2.1 > scala> spark.sql("load data local inpath '/root/test.txt' into table t1 > partition(area ='025')") > org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: > /root/test.txt; > at > org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639) > ... 48 elided > scala> > {code} > In fact, the input path exists, but the mr user does not have permission to > access the directory `/root/` ,so the message throwed by `AnalysisException` > can misleading user fix problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12911) Cacheing a dataframe causes array comparisons to fail (in filter / where) after 1.6
[ https://issues.apache.org/jira/browse/SPARK-12911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558087#comment-16558087 ] Hyukjin Kwon commented on SPARK-12911: -- Looks indeed similar. Mind if I ask to open another JIRA for it separately? Symptom looks similar but difficult to judge if they are actually same or not. > Cacheing a dataframe causes array comparisons to fail (in filter / where) > after 1.6 > --- > > Key: SPARK-12911 > URL: https://issues.apache.org/jira/browse/SPARK-12911 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.6.0 >Reporter: Jesse English >Priority: Major > > When doing a *where* operation on a dataframe and testing for equality on an > array type, after 1.6 no valid comparisons are made if the dataframe has been > cached. If it has not been cached, the results are as expected. > This appears to be related to the underlying unsafe array data types. > {code:title=test.scala|borderStyle=solid} > test("test array comparison") { > val vectors: Vector[Row] = Vector( > Row.fromTuple("id_1" -> Array(0L, 2L)), > Row.fromTuple("id_2" -> Array(0L, 5L)), > Row.fromTuple("id_3" -> Array(0L, 9L)), > Row.fromTuple("id_4" -> Array(1L, 0L)), > Row.fromTuple("id_5" -> Array(1L, 8L)), > Row.fromTuple("id_6" -> Array(2L, 4L)), > Row.fromTuple("id_7" -> Array(5L, 6L)), > Row.fromTuple("id_8" -> Array(6L, 2L)), > Row.fromTuple("id_9" -> Array(7L, 0L)) > ) > val data: RDD[Row] = sc.parallelize(vectors, 3) > val schema = StructType( > StructField("id", StringType, false) :: > StructField("point", DataTypes.createArrayType(LongType, false), > false) :: > Nil > ) > val sqlContext = new SQLContext(sc) > val dataframe = sqlContext.createDataFrame(data, schema) > val targetPoint:Array[Long] = Array(0L,9L) > //Cacheing is the trigger to cause the error (no cacheing causes no error) > dataframe.cache() > //This is the line where it fails > //java.util.NoSuchElementException: next on empty iterator > //However we know that there is a valid match > val targetRow = dataframe.where(dataframe("point") === > array(targetPoint.map(value => lit(value)): _*)).first() > assert(targetRow != null) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
Xiaochen Ouyang created SPARK-24930: --- Summary: Exception information is not accurate when using `LOAD DATA LOCAL INPATH` Key: SPARK-24930 URL: https://issues.apache.org/jira/browse/SPARK-24930 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1, 2.3.0, 2.2.2, 2.2.1 Reporter: Xiaochen Ouyang # root user create a test.txt file contains a record '123' in /root/ directory # switch mr user to execute spark-shell --master local {code:java} scala> spark.version res2: String = 2.2.1 scala> spark.sql("load data local inpath '/root/test.txt' into table t1 partition(area ='025')") org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: /root/test.txt; at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:183) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639) ... 48 elided scala> {code} In fact, the input path exists, but the mr user does not have permission to access the directory `/root/` ,so the message throwed by `AnalysisException` can misleading user fix problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24928) spark sql cross join running time too long
[ https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LIFULONG updated SPARK-24928: - Description: spark sql running time is too long while input left table and right table is small hdfs text format data, the sql is: select * from t1 cross join t2 the line of t1 is 49, three column the line of t2 is 1, one column only running more than 30mins and then failed spark CartesianRDD also has the same problem, example test code is: val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b") //1 line 1 column val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b") //49 line 3 column val cartesian = new CartesianRDD(sc, twos, ones) cartesian.count() running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use less than 10 seconds was: spark sql running time is too long while input left table and right table is small text format data, the sql is: select * from t1 cross join t2 the line of t1 is 49, three column the line of t2 is 1, one column only running more than 30mins and then failed spark CartesianRDD also has the same problem, example test code is: val ones = sc.textFile("file:///Users/moses/4paradigm/data/cartesian_data/t1b") //1 line 1 column val twos = sc.textFile("file:///Users/moses/4paradigm/data/cartesian_data/t2b") //49 line 3 column val cartesian = new CartesianRDD(sc, twos, ones) cartesian.count() running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use less than 10 seconds > spark sql cross join running time too long > -- > > Key: SPARK-24928 > URL: https://issues.apache.org/jira/browse/SPARK-24928 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.2 >Reporter: LIFULONG >Priority: Minor > > spark sql running time is too long while input left table and right table is > small hdfs text format data, > the sql is: select * from t1 cross join t2 > the line of t1 is 49, three column > the line of t2 is 1, one column only > running more than 30mins and then failed > > > spark CartesianRDD also has the same problem, example test code is: > val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b") //1 line > 1 column > val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b") //49 > line 3 column > val cartesian = new CartesianRDD(sc, twos, ones) > cartesian.count() > running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use > less than 10 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24929) Merge script swallow KeyboardInterrupt
[ https://issues.apache.org/jira/browse/SPARK-24929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558065#comment-16558065 ] Apache Spark commented on SPARK-24929: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/21880 > Merge script swallow KeyboardInterrupt > -- > > Key: SPARK-24929 > URL: https://issues.apache.org/jira/browse/SPARK-24929 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > If I want to get out of the loop to assign JIRA's user by command+c > (KeyboardInterrupt), I am unable to get out as below: > {code} > Error assigning JIRA, try again (or leave blank and fix manually) > JIRA is unassigned, choose assignee > [0] todd.chen (Reporter) > Enter number of user, or userid, to assign to (blank to leave > unassigned):Traceback (most recent call last): > File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee > "Enter number of user, or userid, to assign to (blank to leave > unassigned):") > KeyboardInterrupt > Error assigning JIRA, try again (or leave blank and fix manually) > JIRA is unassigned, choose assignee > [0] todd.chen (Reporter) > Enter number of user, or userid, to assign to (blank to leave > unassigned):Traceback (most recent call last): > File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee > "Enter number of user, or userid, to assign to (blank to leave > unassigned):") > KeyboardInterrupt > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24929) Merge script swallow KeyboardInterrupt
[ https://issues.apache.org/jira/browse/SPARK-24929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24929: Assignee: (was: Apache Spark) > Merge script swallow KeyboardInterrupt > -- > > Key: SPARK-24929 > URL: https://issues.apache.org/jira/browse/SPARK-24929 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > If I want to get out of the loop to assign JIRA's user by command+c > (KeyboardInterrupt), I am unable to get out as below: > {code} > Error assigning JIRA, try again (or leave blank and fix manually) > JIRA is unassigned, choose assignee > [0] todd.chen (Reporter) > Enter number of user, or userid, to assign to (blank to leave > unassigned):Traceback (most recent call last): > File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee > "Enter number of user, or userid, to assign to (blank to leave > unassigned):") > KeyboardInterrupt > Error assigning JIRA, try again (or leave blank and fix manually) > JIRA is unassigned, choose assignee > [0] todd.chen (Reporter) > Enter number of user, or userid, to assign to (blank to leave > unassigned):Traceback (most recent call last): > File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee > "Enter number of user, or userid, to assign to (blank to leave > unassigned):") > KeyboardInterrupt > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24929) Merge script swallow KeyboardInterrupt
[ https://issues.apache.org/jira/browse/SPARK-24929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24929: Assignee: Apache Spark > Merge script swallow KeyboardInterrupt > -- > > Key: SPARK-24929 > URL: https://issues.apache.org/jira/browse/SPARK-24929 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Trivial > > If I want to get out of the loop to assign JIRA's user by command+c > (KeyboardInterrupt), I am unable to get out as below: > {code} > Error assigning JIRA, try again (or leave blank and fix manually) > JIRA is unassigned, choose assignee > [0] todd.chen (Reporter) > Enter number of user, or userid, to assign to (blank to leave > unassigned):Traceback (most recent call last): > File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee > "Enter number of user, or userid, to assign to (blank to leave > unassigned):") > KeyboardInterrupt > Error assigning JIRA, try again (or leave blank and fix manually) > JIRA is unassigned, choose assignee > [0] todd.chen (Reporter) > Enter number of user, or userid, to assign to (blank to leave > unassigned):Traceback (most recent call last): > File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee > "Enter number of user, or userid, to assign to (blank to leave > unassigned):") > KeyboardInterrupt > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24928) spark sql cross join running time too long
[ https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LIFULONG updated SPARK-24928: - Priority: Minor (was: Major) > spark sql cross join running time too long > -- > > Key: SPARK-24928 > URL: https://issues.apache.org/jira/browse/SPARK-24928 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 1.6.2 >Reporter: LIFULONG >Priority: Minor > > spark sql running time is too long while input left table and right table is > small text format data, > the sql is: select * from t1 cross join t2 > the line of t1 is 49, three column > the line of t2 is 1, one column only > running more than 30mins and then failed > > > spark CartesianRDD also has the same problem, example test code is: > val ones = > sc.textFile("file:///Users/moses/4paradigm/data/cartesian_data/t1b") //1 > line 1 column > val twos = > sc.textFile("file:///Users/moses/4paradigm/data/cartesian_data/t2b") > //49 line 3 column > val cartesian = new CartesianRDD(sc, twos, ones) > cartesian.count() > running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use > less than 10 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24929) Merge script swallow KeyboardInterrupt
Hyukjin Kwon created SPARK-24929: Summary: Merge script swallow KeyboardInterrupt Key: SPARK-24929 URL: https://issues.apache.org/jira/browse/SPARK-24929 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 2.4.0 Reporter: Hyukjin Kwon If I want to get out of the loop to assign JIRA's user by command+c (KeyboardInterrupt), I am unable to get out as below: {code} Error assigning JIRA, try again (or leave blank and fix manually) JIRA is unassigned, choose assignee [0] todd.chen (Reporter) Enter number of user, or userid, to assign to (blank to leave unassigned):Traceback (most recent call last): File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee "Enter number of user, or userid, to assign to (blank to leave unassigned):") KeyboardInterrupt Error assigning JIRA, try again (or leave blank and fix manually) JIRA is unassigned, choose assignee [0] todd.chen (Reporter) Enter number of user, or userid, to assign to (blank to leave unassigned):Traceback (most recent call last): File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee "Enter number of user, or userid, to assign to (blank to leave unassigned):") KeyboardInterrupt {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24924. -- Resolution: Fixed Fixed in https://github.com/apache/spark/pull/21878 > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24924: - Fix Version/s: 2.4.0 > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24924) Add mapping for built-in Avro data source
[ https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-24924: Assignee: Dongjoon Hyun > Add mapping for built-in Avro data source > - > > Key: SPARK-24924 > URL: https://issues.apache.org/jira/browse/SPARK-24924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > This issue aims to the followings. > # Like `com.databricks.spark.csv` mapping, we had better map > `com.databricks.spark.avro` to built-in Avro data source. > # Remove incorrect error message, `Please find an Avro package at ...`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24928) spark sql cross join running time too long
LIFULONG created SPARK-24928: Summary: spark sql cross join running time too long Key: SPARK-24928 URL: https://issues.apache.org/jira/browse/SPARK-24928 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 1.6.2 Reporter: LIFULONG spark sql running time is too long while input left table and right table is small text format data, the sql is: select * from t1 cross join t2 the line of t1 is 49, three column the line of t2 is 1, one column only running more than 30mins and then failed spark CartesianRDD also has the same problem, example test code is: val ones = sc.textFile("file:///Users/moses/4paradigm/data/cartesian_data/t1b") //1 line 1 column val twos = sc.textFile("file:///Users/moses/4paradigm/data/cartesian_data/t2b") //49 line 3 column val cartesian = new CartesianRDD(sc, twos, ones) cartesian.count() running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use less than 10 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21436) Take advantage of known partioner for distinct on RDDs
[ https://issues.apache.org/jira/browse/SPARK-21436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558013#comment-16558013 ] zhengruifeng edited comment on SPARK-21436 at 7/26/18 7:38 AM: --- [~holdenk] It looks like that {{distinct}} already {color:#33}utilized the known{color} partitioner. \{{distinct}} calls {color:#33}{{{color:#ffc66d}combineByKeyWithClassTag}} internally, and will avoid the shuffle if rdd's partitioner is equal to a hash partitioner with same #partitions.{color}{color} Or do you mean that we need to expose some API like {{distinct(partitioner: Partitioner)}} for other kinds of partitioners like \{{RangePartitioner}}? was (Author: podongfeng): [~holdenk] It looks like that \{{distinct}} already utilized the known partitioner. \{distinct} calls \{{color:#ffc66d}{color:#33}combineByKeyWithClassTag} internally, and will avoid the shuffle if rdd's partitioner is equal to a hash partitioner with same #partitions.{color} Or do you mean that we need to expose some API like \{distinct(partitioner: Partitioner)} for other kinds of partitioners like \{RangePartitioner}? > Take advantage of known partioner for distinct on RDDs > -- > > Key: SPARK-21436 > URL: https://issues.apache.org/jira/browse/SPARK-21436 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: holdenk >Priority: Minor > > If we have a known partitioner we should be able to avoid the shuffle. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21436) Take advantage of known partioner for distinct on RDDs
[ https://issues.apache.org/jira/browse/SPARK-21436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558013#comment-16558013 ] zhengruifeng edited comment on SPARK-21436 at 7/26/18 7:37 AM: --- [~holdenk] It looks like that \{{distinct}} already utilized the known partitioner. \{distinct} calls \{{color:#ffc66d}{color:#33}combineByKeyWithClassTag} internally, and will avoid the shuffle if rdd's partitioner is equal to a hash partitioner with same #partitions.{color} Or do you mean that we need to expose some API like \{distinct(partitioner: Partitioner)} for other kinds of partitioners like \{RangePartitioner}? was (Author: podongfeng): [~holdenk] It looks like that \{distinct} already utilized the known partitioner. \{distinct} calls {{color:#ffc66d}{color:#33}combineByKeyWithClassTag} internally, and will avoid the shuffle if rdd's partitioner is equal to a hash partitioner with same #partitions. {color}{color} {color:#ffc66d}{color:#33}Or do you mean that we need to expose some API like {distinct(partitioner: {color}{color}Partitioner)} for other kinds of partitioners like \{RangePartitioner}? > Take advantage of known partioner for distinct on RDDs > -- > > Key: SPARK-21436 > URL: https://issues.apache.org/jira/browse/SPARK-21436 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: holdenk >Priority: Minor > > If we have a known partitioner we should be able to avoid the shuffle. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21436) Take advantage of known partioner for distinct on RDDs
[ https://issues.apache.org/jira/browse/SPARK-21436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558013#comment-16558013 ] zhengruifeng commented on SPARK-21436: -- [~holdenk] It looks like that \{distinct} already utilized the known partitioner. \{distinct} calls {{color:#ffc66d}{color:#33}combineByKeyWithClassTag} internally, and will avoid the shuffle if rdd's partitioner is equal to a hash partitioner with same #partitions. {color}{color} {color:#ffc66d}{color:#33}Or do you mean that we need to expose some API like {distinct(partitioner: {color}{color}Partitioner)} for other kinds of partitioners like \{RangePartitioner}? > Take advantage of known partioner for distinct on RDDs > -- > > Key: SPARK-21436 > URL: https://issues.apache.org/jira/browse/SPARK-21436 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: holdenk >Priority: Minor > > If we have a known partitioner we should be able to avoid the shuffle. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23683) FileCommitProtocol.instantiate to require 3-arg constructor for dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-23683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558009#comment-16558009 ] Felix Cheung commented on SPARK-23683: -- should this be ported back to branch 2.3 ? we ran into this problem and the original change seems to be in 2.3.0 release SPARK-20236 > FileCommitProtocol.instantiate to require 3-arg constructor for dynamic > partition overwrite > --- > > Key: SPARK-23683 > URL: https://issues.apache.org/jira/browse/SPARK-23683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Major > Fix For: 2.4.0 > > > with SPARK-20236 {{FileCommitProtocol.instantiate()}} looks for a three > argument constructor, passing in the {{dynamicPartitionOverwrite}} parameter. > If there is no such constructor, it falls back to the classic two-arg one. > When {{InsertIntoHadoopFsRelationCommand}} passes down that > {{dynamicPartitionOverwrite}} flag to {{FileCommitProtocol.instantiate()}}, > it _assumes_ that the instantiated protocol supports the specific > requirements of dynamic partition overwrite. It does not notice when this > does not hold, and so the output generated may be incorrect. > Proposed: when dynamicPartitionOverwrite == true, require the protocol > implementation to have a 3-arg constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24878) Fix reverse function for array type of primitive type containing null.
[ https://issues.apache.org/jira/browse/SPARK-24878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24878. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21830 [https://github.com/apache/spark/pull/21830] > Fix reverse function for array type of primitive type containing null. > -- > > Key: SPARK-24878 > URL: https://issues.apache.org/jira/browse/SPARK-24878 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.4.0 > > > If we use {{reverse}} function for array type of primitive type containing > {{null}} and the child array is {{UnsafeArrayData}}, the function returns a > wrong result because {{UnsafeArrayData}} doesn't define the behavior of > re-assignment, especially we can't set a valid value after we set {{null}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files
[ https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16557985#comment-16557985 ] Apache Spark commented on SPARK-24927: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/21879 > The hadoop-provided profile doesn't play well with Snappy-compressed Parquet > files > -- > > Key: SPARK-24927 > URL: https://issues.apache.org/jira/browse/SPARK-24927 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.1, 2.3.2 >Reporter: Cheng Lian >Priority: Major > > Reproduction: > {noformat} > wget > https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz > wget > https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz > tar xzf spark-2.3.1-bin-without-hadoop.tgz > tar xzf hadoop-2.7.3.tar.gz > export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath) > ./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local > ... > scala> > spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet") > {noformat} > Exception: > {noformat} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194) > ... 69 more > Caused by: org.apache.spark.SparkException: Task failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.UnsatisfiedLinkError: > org.xerial.snappy.SnappyNative.maxCompressedLength(I)I > at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) > at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) > at > org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67) > at > org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81) > at > org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92) > at > org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112) > at > org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93) > at > org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150) > at > org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238) > at > org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121) > at > org.apache.parquet.ha
[jira] [Assigned] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files
[ https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24927: Assignee: Apache Spark > The hadoop-provided profile doesn't play well with Snappy-compressed Parquet > files > -- > > Key: SPARK-24927 > URL: https://issues.apache.org/jira/browse/SPARK-24927 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.1, 2.3.2 >Reporter: Cheng Lian >Assignee: Apache Spark >Priority: Major > > Reproduction: > {noformat} > wget > https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz > wget > https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz > tar xzf spark-2.3.1-bin-without-hadoop.tgz > tar xzf hadoop-2.7.3.tar.gz > export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath) > ./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local > ... > scala> > spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet") > {noformat} > Exception: > {noformat} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194) > ... 69 more > Caused by: org.apache.spark.SparkException: Task failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.UnsatisfiedLinkError: > org.xerial.snappy.SnappyNative.maxCompressedLength(I)I > at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) > at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) > at > org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67) > at > org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81) > at > org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92) > at > org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112) > at > org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93) > at > org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150) > at > org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238) > at > org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
[jira] [Assigned] (SPARK-24878) Fix reverse function for array type of primitive type containing null.
[ https://issues.apache.org/jira/browse/SPARK-24878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24878: --- Assignee: Takuya Ueshin > Fix reverse function for array type of primitive type containing null. > -- > > Key: SPARK-24878 > URL: https://issues.apache.org/jira/browse/SPARK-24878 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > > If we use {{reverse}} function for array type of primitive type containing > {{null}} and the child array is {{UnsafeArrayData}}, the function returns a > wrong result because {{UnsafeArrayData}} doesn't define the behavior of > re-assignment, especially we can't set a valid value after we set {{null}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files
[ https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24927: Assignee: (was: Apache Spark) > The hadoop-provided profile doesn't play well with Snappy-compressed Parquet > files > -- > > Key: SPARK-24927 > URL: https://issues.apache.org/jira/browse/SPARK-24927 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.1, 2.3.2 >Reporter: Cheng Lian >Priority: Major > > Reproduction: > {noformat} > wget > https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz > wget > https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz > tar xzf spark-2.3.1-bin-without-hadoop.tgz > tar xzf hadoop-2.7.3.tar.gz > export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath) > ./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local > ... > scala> > spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet") > {noformat} > Exception: > {noformat} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194) > ... 69 more > Caused by: org.apache.spark.SparkException: Task failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.UnsatisfiedLinkError: > org.xerial.snappy.SnappyNative.maxCompressedLength(I)I > at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) > at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) > at > org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67) > at > org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81) > at > org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92) > at > org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112) > at > org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93) > at > org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150) > at > org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238) > at > org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167) > at > org.apache.par
[jira] [Commented] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files
[ https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16557603#comment-16557603 ] Cheng Lian commented on SPARK-24927: Downgraded from blocker to major, since it's not a regression. Just realized that this issue existed ever since at least 1.6. > The hadoop-provided profile doesn't play well with Snappy-compressed Parquet > files > -- > > Key: SPARK-24927 > URL: https://issues.apache.org/jira/browse/SPARK-24927 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.1, 2.3.2 >Reporter: Cheng Lian >Priority: Major > > Reproduction: > {noformat} > wget > https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz > wget > https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz > tar xzf spark-2.3.1-bin-without-hadoop.tgz > tar xzf hadoop-2.7.3.tar.gz > export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath) > ./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local > ... > scala> > spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet") > {noformat} > Exception: > {noformat} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194) > ... 69 more > Caused by: org.apache.spark.SparkException: Task failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.UnsatisfiedLinkError: > org.xerial.snappy.SnappyNative.maxCompressedLength(I)I > at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) > at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) > at > org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67) > at > org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81) > at > org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92) > at > org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112) > at > org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93) > at > org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150) > at > org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238) > at > org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121) > at > o