date:20180726

[jira] [Commented] (SPARK-24288) Enable preventing predicate pushdown

2018-07-26 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559339#comment-16559339
 ] 

Xiao Li commented on SPARK-24288:
-

[~hyukjin.kwon]I updated the JIRA description. 

> Enable preventing predicate pushdown
> 
>
> Key: SPARK-24288
> URL: https://issues.apache.org/jira/browse/SPARK-24288
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Tomasz Gawęda
>Assignee: Maryann Xue
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-24288.simple.patch
>
>
> Issue discussed on Mailing List: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]
> While working with JDBC datasource I saw that many "or" clauses with 
> non-equality operators causes huge performance degradation of SQL query 
> to database (DB2). For example: 
> val df = spark.read.format("jdbc").(other options to parallelize 
> load).load() 
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
>  > 100)").show() // in real application whose predicates were pushed 
> many many lines below, many ANDs and ORs 
> If I use cache() before where, there is no predicate pushdown of this 
> "where" clause. However, in production system caching many sources is a 
> waste of memory (especially is pipeline is long and I must do cache many 
> times).There are also few more workarounds, but it would be great if Spark 
> will support preventing predicate pushdown by user.
>  
> For example: df.withAnalysisBarrier().where(...) ?
>  
> Note, that this should not be a global configuration option. If I read 2 
> DataFrames, df1 and df2, I would like to specify that df1 should not have 
> some predicates pushed down, but some may be, but df2 should have all 
> predicates pushed down, even if target query joins df1 and df2. As far as I 
> understand Spark optimizer, if we use functions like `withAnalysisBarrier` 
> and put AnalysisBarrier explicitly in logical plan, then predicates won't be 
> pushed down on this particular DataFrames and PP will be still possible on 
> the second one.
> Update: The solution is to add a JDBC Option "pushDownPredicate" (default 
> true) to allow/disallow predicate push-down in JDBC data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24288) Enable preventing predicate pushdown

2018-07-26 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24288:

Description: 
Issue discussed on Mailing List: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]

While working with JDBC datasource I saw that many "or" clauses with 
non-equality operators causes huge performance degradation of SQL query 
to database (DB2). For example: 

val df = spark.read.format("jdbc").(other options to parallelize 
load).load() 

df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
 > 100)").show() // in real application whose predicates were pushed 
many many lines below, many ANDs and ORs 

If I use cache() before where, there is no predicate pushdown of this 
"where" clause. However, in production system caching many sources is a 
waste of memory (especially is pipeline is long and I must do cache many 
times).There are also few more workarounds, but it would be great if Spark will 
support preventing predicate pushdown by user.

 

For example: df.withAnalysisBarrier().where(...) ?

 

Note, that this should not be a global configuration option. If I read 2 
DataFrames, df1 and df2, I would like to specify that df1 should not have some 
predicates pushed down, but some may be, but df2 should have all predicates 
pushed down, even if target query joins df1 and df2. As far as I understand 
Spark optimizer, if we use functions like `withAnalysisBarrier` and put 
AnalysisBarrier explicitly in logical plan, then predicates won't be pushed 
down on this particular DataFrames and PP will be still possible on the second 
one.


Update: The solution is to add a JDBC Option "pushDownPredicate" (default true) 
to allow/disallow predicate push-down in JDBC data source.


  was:
Issue discussed on Mailing List: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]

While working with JDBC datasource I saw that many "or" clauses with 
non-equality operators causes huge performance degradation of SQL query 
to database (DB2). For example: 

val df = spark.read.format("jdbc").(other options to parallelize 
load).load() 

df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
 > 100)").show() // in real application whose predicates were pushed 
many many lines below, many ANDs and ORs 

If I use cache() before where, there is no predicate pushdown of this 
"where" clause. However, in production system caching many sources is a 
waste of memory (especially is pipeline is long and I must do cache many 
times).There are also few more workarounds, but it would be great if Spark will 
support preventing predicate pushdown by user.

 

For example: df.withAnalysisBarrier().where(...) ?

 

Note, that this should not be a global configuration option. If I read 2 
DataFrames, df1 and df2, I would like to specify that df1 should not have some 
predicates pushed down, but some may be, but df2 should have all predicates 
pushed down, even if target query joins df1 and df2. As far as I understand 
Spark optimizer, if we use functions like `withAnalysisBarrier` and put 
AnalysisBarrier explicitly in logical plan, then predicates won't be pushed 
down on this particular DataFrames and PP will be still possible on the second 
one.


> Enable preventing predicate pushdown
> 
>
> Key: SPARK-24288
> URL: https://issues.apache.org/jira/browse/SPARK-24288
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Tomasz Gawęda
>Assignee: Maryann Xue
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-24288.simple.patch
>
>
> Issue discussed on Mailing List: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]
> While working with JDBC datasource I saw that many "or" clauses with 
> non-equality operators causes huge performance degradation of SQL query 
> to database (DB2). For example: 
> val df = spark.read.format("jdbc").(other options to parallelize 
> load).load() 
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
>  > 100)").show() // in real application whose predicates were pushed 
> many many lines below, many ANDs and ORs 
> If I use cache() before where, there is no predicate pushdown of this 
> "where" clause. However, in production system caching many sources is a 
> waste of memory (especially is pipeline is long and I must do cache many 
> times).There are also few more workarounds, but it would be great if Spark 
> will support preventing predicate pushdown by user.
>  
> For example: df.withAnalysisBarrier().where(...) ?
>  
> Note, that this should not be a global configuration option. If I read

[jira] [Resolved] (SPARK-24288) Enable preventing predicate pushdown

2018-07-26 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24288.
-
   Resolution: Fixed
 Assignee: Maryann Xue
Fix Version/s: 2.4.0

> Enable preventing predicate pushdown
> 
>
> Key: SPARK-24288
> URL: https://issues.apache.org/jira/browse/SPARK-24288
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Tomasz Gawęda
>Assignee: Maryann Xue
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-24288.simple.patch
>
>
> Issue discussed on Mailing List: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]
> While working with JDBC datasource I saw that many "or" clauses with 
> non-equality operators causes huge performance degradation of SQL query 
> to database (DB2). For example: 
> val df = spark.read.format("jdbc").(other options to parallelize 
> load).load() 
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
>  > 100)").show() // in real application whose predicates were pushed 
> many many lines below, many ANDs and ORs 
> If I use cache() before where, there is no predicate pushdown of this 
> "where" clause. However, in production system caching many sources is a 
> waste of memory (especially is pipeline is long and I must do cache many 
> times).There are also few more workarounds, but it would be great if Spark 
> will support preventing predicate pushdown by user.
>  
> For example: df.withAnalysisBarrier().where(...) ?
>  
> Note, that this should not be a global configuration option. If I read 2 
> DataFrames, df1 and df2, I would like to specify that df1 should not have 
> some predicates pushed down, but some may be, but df2 should have all 
> predicates pushed down, even if target query joins df1 and df2. As far as I 
> understand Spark optimizer, if we use functions like `withAnalysisBarrier` 
> and put AnalysisBarrier explicitly in logical plan, then predicates won't be 
> pushed down on this particular DataFrames and PP will be still possible on 
> the second one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24865) Remove AnalysisBarrier

2018-07-26 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24865.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21822
[https://github.com/apache/spark/pull/21822]

> Remove AnalysisBarrier
> --
>
> Key: SPARK-24865
> URL: https://issues.apache.org/jira/browse/SPARK-24865
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
>
> AnalysisBarrier was introduced in SPARK-20392 to improve analysis speed 
> (don't re-analyze nodes that have already been analyzed).
> Before AnalysisBarrier, we already had some infrastructure in place, with 
> analysis specific functions (resolveOperators and resolveExpressions). These 
> functions do not recursively traverse down subplans that are already analyzed 
> (with a mutable boolean flag _analyzed). The issue with the old system was 
> that developers started using transformDown, which does a top-down traversal 
> of the plan tree, because there was not top-down resolution function, and as 
> a result analyzer performance became pretty bad.
> In order to fix the issue in SPARK-20392, AnalysisBarrier was introduced as a 
> special node and for this special node, transform/transformUp/transformDown 
> don't traverse down. However, the introduction of this special node caused a 
> lot more troubles than it solves. This implicit node breaks assumptions and 
> code in a few places, and it's hard to know when analysis barrier would 
> exist, and when it wouldn't. Just a simple search of AnalysisBarrier in PR 
> discussions demonstrates it is a source of bugs and additional complexity.
> Instead, I think a much simpler fix to the original issue is to introduce 
> resolveOperatorsDown, and change all places that call transformDown in the 
> analyzer to use that. We can also ban accidental uses of the various 
> transform* methods by using a linter (which can only lint specific packages), 
> or in test mode inspect the stack trace and fail explicitly if transform* are 
> called in the analyzer. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24865) Remove AnalysisBarrier

2018-07-26 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24865:
---

Assignee: Reynold Xin

> Remove AnalysisBarrier
> --
>
> Key: SPARK-24865
> URL: https://issues.apache.org/jira/browse/SPARK-24865
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
>
> AnalysisBarrier was introduced in SPARK-20392 to improve analysis speed 
> (don't re-analyze nodes that have already been analyzed).
> Before AnalysisBarrier, we already had some infrastructure in place, with 
> analysis specific functions (resolveOperators and resolveExpressions). These 
> functions do not recursively traverse down subplans that are already analyzed 
> (with a mutable boolean flag _analyzed). The issue with the old system was 
> that developers started using transformDown, which does a top-down traversal 
> of the plan tree, because there was not top-down resolution function, and as 
> a result analyzer performance became pretty bad.
> In order to fix the issue in SPARK-20392, AnalysisBarrier was introduced as a 
> special node and for this special node, transform/transformUp/transformDown 
> don't traverse down. However, the introduction of this special node caused a 
> lot more troubles than it solves. This implicit node breaks assumptions and 
> code in a few places, and it's hard to know when analysis barrier would 
> exist, and when it wouldn't. Just a simple search of AnalysisBarrier in PR 
> discussions demonstrates it is a source of bugs and additional complexity.
> Instead, I think a much simpler fix to the original issue is to introduce 
> resolveOperatorsDown, and change all places that call transformDown in the 
> analyzer to use that. We can also ban accidental uses of the various 
> transform* methods by using a linter (which can only lint specific packages), 
> or in test mode inspect the stack trace and fail explicitly if transform* are 
> called in the analyzer. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11083) insert overwrite table failed when beeline reconnect

2018-07-26 Thread readme_kylin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-11083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559297#comment-16559297
 ] 

readme_kylin commented on SPARK-11083:
--

is any one  working on this issue?

spark 2.1.0 thrift server: 

java.lang.reflect.InvocationTargetException
 at sun.reflect.GeneratedMethodAccessor122.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:716)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:672)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:672)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:672)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:230)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:229)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:272)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:671)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:741)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:739)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:739)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:739)
 at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:323)
 at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:170)
 at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:347)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
 at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
 at org.apache.spark.sql.Dataset.(Dataset.scala:185)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
 at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699)
 at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:220)
 at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:163)
 at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:160)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
 at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:173)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
source hdfs

 

 

 

> insert overwrite table failed when beeline reconnect
> 
>
> Key: SPARK-11083
> URL: https://issues.apache.org/jira/browse/SPARK-11083
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Spark: master branch
> Hadoop: 2.7.1
> JDK: 1.8.0_60
>Reporter: Weizhong
>Assignee: Davies Liu
>Priority: Major
>
> 1. Start Thriftserver
> 2. Use

[jira] [Resolved] (SPARK-24929) Merge script swallow KeyboardInterrupt

2018-07-26 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24929.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21880
[https://github.com/apache/spark/pull/21880]

> Merge script swallow KeyboardInterrupt
> --
>
> Key: SPARK-24929
> URL: https://issues.apache.org/jira/browse/SPARK-24929
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.4.0
>
>
> If I want to get out of the loop to assign JIRA's user by command+c 
> (KeyboardInterrupt), I am unable to get out as below:
> {code}
> Error assigning JIRA, try again (or leave blank and fix manually)
> JIRA is unassigned, choose assignee
> [0] todd.chen (Reporter)
> Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):Traceback (most recent call last):
>   File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee
> "Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):")
> KeyboardInterrupt
> Error assigning JIRA, try again (or leave blank and fix manually)
> JIRA is unassigned, choose assignee
> [0] todd.chen (Reporter)
> Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):Traceback (most recent call last):
>   File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee
> "Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):")
> KeyboardInterrupt
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24929) Merge script swallow KeyboardInterrupt

2018-07-26 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24929:


Assignee: Hyukjin Kwon

> Merge script swallow KeyboardInterrupt
> --
>
> Key: SPARK-24929
> URL: https://issues.apache.org/jira/browse/SPARK-24929
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.4.0
>
>
> If I want to get out of the loop to assign JIRA's user by command+c 
> (KeyboardInterrupt), I am unable to get out as below:
> {code}
> Error assigning JIRA, try again (or leave blank and fix manually)
> JIRA is unassigned, choose assignee
> [0] todd.chen (Reporter)
> Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):Traceback (most recent call last):
>   File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee
> "Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):")
> KeyboardInterrupt
> Error assigning JIRA, try again (or leave blank and fix manually)
> JIRA is unassigned, choose assignee
> [0] todd.chen (Reporter)
> Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):Traceback (most recent call last):
>   File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee
> "Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):")
> KeyboardInterrupt
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24829) In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or spark-sql

2018-07-26 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24829.
--
   Resolution: Fixed
 Assignee: zuotingbing
Fix Version/s: 2.4.0

Fixed in https://github.com/apache/spark/pull/21789

> In Spark Thrift Server, CAST AS FLOAT inconsistent with spark-shell or 
> spark-sql 
> -
>
> Key: SPARK-24829
> URL: https://issues.apache.org/jira/browse/SPARK-24829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: zuotingbing
>Assignee: zuotingbing
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: 2018-07-18_110944.png, 2018-07-18_11.png
>
>
> SELECT CAST('4.56' AS FLOAT)
> the result is 4.55942779541 , it should be 4.56



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24942) Improve cluster resource management with jobs containing barrier stage

2018-07-26 Thread Jiang Xingbo (JIRA)

Jiang Xingbo created SPARK-24942:


 Summary: Improve cluster resource management with jobs containing 
barrier stage
 Key: SPARK-24942
 URL: https://issues.apache.org/jira/browse/SPARK-24942
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jiang Xingbo


https://github.com/apache/spark/pull/21758#discussion_r205652317

We shall improve cluster resource management to address the following issues:
- With dynamic resource allocation enabled, it may happen that we acquire some 
executors (but not enough to launch all the tasks in a barrier stage) and later 
release them due to executor idle time expire, and then acquire again.
- There can be deadlock with two concurrent applications. Each application may 
acquire some resources, but not enough to launch all the tasks in a barrier 
stage. And after hitting the idle timeout and releasing them, they may acquire 
resources again, but just continually trade resources between each other.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24941) Add RDDBarrier.coalesce() function

2018-07-26 Thread Jiang Xingbo (JIRA)

Jiang Xingbo created SPARK-24941:


 Summary: Add RDDBarrier.coalesce() function
 Key: SPARK-24941
 URL: https://issues.apache.org/jira/browse/SPARK-24941
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jiang Xingbo


https://github.com/apache/spark/pull/21758#discussion_r204917245

The number of partitions from the input data can be unexpectedly large, eg. if 
you do
{code}
sc.textFile(...).barrier().mapPartitions()
{code}
The number of input partitions is based on the hdfs input splits. We shall 
provide a way in RDDBarrier to enable users to specify the number of tasks in a 
barrier stage. Maybe something like RDDBarrier.coalesce(numPartitions: Int) .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24801) Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory

2018-07-26 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-24801.
--
   Resolution: Fixed
 Assignee: Misha Dmitriev
Fix Version/s: 2.4.0

Fixed in commit 
https://github.com/apache/spark/commit/094aa597155dfcbf41a2490c9e462415e3824901
from https://github.com/apache/spark/pull/21811

> Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can 
> waste a lot of memory
> ---
>
> Key: SPARK-24801
> URL: https://issues.apache.org/jira/browse/SPARK-24801
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Assignee: Misha Dmitriev
>Priority: Major
>  Labels: memory-analysis
> Fix For: 2.4.0
>
>
> I recently analyzed another Yarn NM heap dump with jxray 
> ([www.jxray.com),|http://www.jxray.com),/] and found that 81% of memory is 
> wasted by empty (all zeroes) byte[] arrays. Most of these arrays are 
> referenced by 
> {{org.apache.spark.network.util.ByteArrayWritableChannel.data}}, and these in 
> turn come from 
> {{spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel}}. Here is 
> the full reference chain that leads to the problematic arrays:
> {code:java}
> 2,597,946K (64.1%): byte[]: 40583 / 100% of empty 2,597,946K (64.1%)
> ↖org.apache.spark.network.util.ByteArrayWritableChannel.data
> ↖org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.msg
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.{next}
> ↖io.netty.channel.ChannelOutboundBuffer.flushedEntry
> ↖io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer
> ↖io.netty.channel.socket.nio.NioSocketChannel.unsafe
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance{code}
>  
> Checking the code of {{SaslEncryption$EncryptedMessage}}, I see that 
> byteChannel is always initialized eagerly in the constructor:
> {code:java}
> this.byteChannel = new ByteArrayWritableChannel(maxOutboundBlockSize);{code}
> So I think to address the problem of empty byte[] arrays flooding the memory, 
> we should initialize {{byteChannel}} lazily, upon the first use. As far as I 
> can see, it's used only in one method, {{private void nextChunk()}}.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24932) Allow update mode for streaming queries with join

2018-07-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24932:


Assignee: (was: Apache Spark)

> Allow update mode for streaming queries with join
> -
>
> Key: SPARK-24932
> URL: https://issues.apache.org/jira/browse/SPARK-24932
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Eric Fu
>Priority: Major
>
> In issue SPARK-19140 we supported update output mode for non-aggregation 
> streaming queries. This should also be applied to streaming join to keep 
> semantic consistent.
> PS. Streaming join feature is added after SPARK-19140. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24932) Allow update mode for streaming queries with join

2018-07-26 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559181#comment-16559181
 ] 

Apache Spark commented on SPARK-24932:
--

User 'fuyufjh' has created a pull request for this issue:
https://github.com/apache/spark/pull/21890

> Allow update mode for streaming queries with join
> -
>
> Key: SPARK-24932
> URL: https://issues.apache.org/jira/browse/SPARK-24932
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Eric Fu
>Priority: Major
>
> In issue SPARK-19140 we supported update output mode for non-aggregation 
> streaming queries. This should also be applied to streaming join to keep 
> semantic consistent.
> PS. Streaming join feature is added after SPARK-19140. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24932) Allow update mode for streaming queries with join

2018-07-26 Thread Eric Fu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Fu updated SPARK-24932:

Description: 
In issue SPARK-19140 we supported update output mode for non-aggregation 
streaming queries. This should also be applied to streaming join to keep 
semantic consistent.

PS. Streaming join feature is added after SPARK-19140. 

When using update _output_ mode the join will works exactly as _append_ mode. 
However, for example, this will allow user to run an aggregation-after-join 
query in update mode in order to get a more real-time result output.



  was:
In issue SPARK-19140 we supported update output mode for non-aggregation 
streaming queries. This should also be applied to streaming join to keep 
semantic consistent.

PS. Streaming join feature is added after SPARK-19140. 


> Allow update mode for streaming queries with join
> -
>
> Key: SPARK-24932
> URL: https://issues.apache.org/jira/browse/SPARK-24932
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Eric Fu
>Priority: Major
>
> In issue SPARK-19140 we supported update output mode for non-aggregation 
> streaming queries. This should also be applied to streaming join to keep 
> semantic consistent.
> PS. Streaming join feature is added after SPARK-19140. 
> When using update _output_ mode the join will works exactly as _append_ mode. 
> However, for example, this will allow user to run an aggregation-after-join 
> query in update mode in order to get a more real-time result output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24932) Allow update mode for streaming queries with join

2018-07-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24932:


Assignee: Apache Spark

> Allow update mode for streaming queries with join
> -
>
> Key: SPARK-24932
> URL: https://issues.apache.org/jira/browse/SPARK-24932
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Eric Fu
>Assignee: Apache Spark
>Priority: Major
>
> In issue SPARK-19140 we supported update output mode for non-aggregation 
> streaming queries. This should also be applied to streaming join to keep 
> semantic consistent.
> PS. Streaming join feature is added after SPARK-19140. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2018-07-26 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559083#comment-16559083
 ] 

Apache Spark commented on SPARK-4502:
-

User 'ajacques' has created a pull request for this issue:
https://github.com/apache/spark/pull/21889

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Priority: Critical
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24940) Coalesce Hint for SQL Queries

2018-07-26 Thread John Zhuge (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-24940:
---
Summary: Coalesce Hint for SQL Queries  (was: Coalesce Hint for SQL)

> Coalesce Hint for SQL Queries
> -
>
> Key: SPARK-24940
> URL: https://issues.apache.org/jira/browse/SPARK-24940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: John Zhuge
>Priority: Major
>
> Many Spark SQL users in my company have asked for a way to control the number 
> of output files in Spark SQL. The users prefer not to use function 
> repartition\(n\) or coalesce(n, shuffle) that require them to write and 
> deploy Scala/Java/Python code.
>   
>  There are use cases to either reduce or increase the number.
>   
>  The DataFrame API has repartition/coalesce for a long time. However, we do 
> not have an equivalent functionality in SQL queries. We propose adding the 
> following Hive-style Coalesce hint to Spark SQL.
> {noformat}
> /*+ COALESCE(n, shuffle) */
> /*+ REPARTITION(n) */
> {noformat}
> REPARTITION\(n\) is equal to COALESCE(n, shuffle=true).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24940) Coalesce Hint for SQL

2018-07-26 Thread John Zhuge (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-24940:
---
Description: 
Many Spark SQL users in my company have asked for a way to control the number 
of output files in Spark SQL. The users prefer not to use function 
repartition\(n\) or coalesce(n, shuffle) that require them to write and deploy 
Scala/Java/Python code.
  
 There are use cases to either reduce or increase the number.
  
 The DataFrame API has repartition/coalesce for a long time. However, we do not 
have an equivalent functionality in SQL queries. We propose adding the 
following Hive-style Coalesce hint to Spark SQL.
{noformat}
/*+ COALESCE(n, shuffle) */
/*+ REPARTITION(n) */
{noformat}
REPARTITION\(n\) is equal to COALESCE(n, shuffle=true).

  was:
Many Spark SQL users in my company have asked for a way to control the number 
of output files in Spark SQL. The users prefer not to use function 
repartition(n) or coalesce(n, shuffle) that require them to write and deploy 
Scala/Java/Python code.
  
 There are use cases to either reduce or increase the number.
  
 The DataFrame API has repartition/coalesce for a long time. However, we do not 
have an equivalent functionality in SQL queries. We propose adding the 
following Hive-style Coalesce hint to Spark SQL.
{noformat}
/*+ COALESCE(n, shuffle) */
/*+ REPARTITION(n) */
{noformat}
REPARTITION(n) is equal to COALESCE(n, shuffle=true).


> Coalesce Hint for SQL
> -
>
> Key: SPARK-24940
> URL: https://issues.apache.org/jira/browse/SPARK-24940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: John Zhuge
>Priority: Major
>
> Many Spark SQL users in my company have asked for a way to control the number 
> of output files in Spark SQL. The users prefer not to use function 
> repartition\(n\) or coalesce(n, shuffle) that require them to write and 
> deploy Scala/Java/Python code.
>   
>  There are use cases to either reduce or increase the number.
>   
>  The DataFrame API has repartition/coalesce for a long time. However, we do 
> not have an equivalent functionality in SQL queries. We propose adding the 
> following Hive-style Coalesce hint to Spark SQL.
> {noformat}
> /*+ COALESCE(n, shuffle) */
> /*+ REPARTITION(n) */
> {noformat}
> REPARTITION\(n\) is equal to COALESCE(n, shuffle=true).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24940) Coalesce Hint for SQL

2018-07-26 Thread John Zhuge (JIRA)

John Zhuge created SPARK-24940:
--

 Summary: Coalesce Hint for SQL
 Key: SPARK-24940
 URL: https://issues.apache.org/jira/browse/SPARK-24940
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: John Zhuge


Many Spark SQL users in my company have asked for a way to control the number 
of output files in Spark SQL. The users prefer not to use function 
repartition(n) or coalesce(n, shuffle) that require them to write and deploy 
Scala/Java/Python code.
  
 There are use cases to either reduce or increase the number.
  
 The DataFrame API has repartition/coalesce for a long time. However, we do not 
have an equivalent functionality in SQL queries. We propose adding the 
following Hive-style Coalesce hint to Spark SQL.
{noformat}
/*+ COALESCE(n, shuffle) */
/*+ REPARTITION(n) */
{noformat}
REPARTITION(n) is equal to COALESCE(n, shuffle=true).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24919) Scala linter rule for sparkContext.hadoopConfiguration

2018-07-26 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24919.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 2.4.0

> Scala linter rule for sparkContext.hadoopConfiguration
> --
>
> Key: SPARK-24919
> URL: https://issues.apache.org/jira/browse/SPARK-24919
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> In most cases, we should use spark.sessionState.newHadoopConf() instead of 
> sparkContext.hadoopConfiguration, so that the hadoop configurations specified 
> in Spark session
> configuration will come into effect.
> Add a rule matching spark.sparkContext.hadoopConfiguration or 
> spark.sqlContext.sparkContext.hadoopConfiguration to prevent the usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24253) DataSourceV2: Add DeleteSupport for delete and overwrite operations

2018-07-26 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558971#comment-16558971
 ] 

Apache Spark commented on SPARK-24253:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/21888

> DataSourceV2: Add DeleteSupport for delete and overwrite operations
> ---
>
> Key: SPARK-24253
> URL: https://issues.apache.org/jira/browse/SPARK-24253
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> Implementing delete and overwrite logical plans requires an API to delete 
> data from a data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-24937) Datasource partition table should load empty partitions

2018-07-26 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24937:

Comment: was deleted

(was: I'm working on.)

> Datasource partition table should load empty partitions
> ---
>
> Key: SPARK-24937
> URL: https://issues.apache.org/jira/browse/SPARK-24937
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> spark-sql> CREATE TABLE tbl AS SELECT 1;
> spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING)
>  > USING parquet
>  > PARTITIONED BY (day, hour);
> spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') 
> SELECT * FROM tbl where 1=0;
> spark-sql> SHOW PARTITIONS tbl1;
> spark-sql> CREATE TABLE tbl2 (c1 BIGINT)
>  > PARTITIONED BY (day STRING, hour STRING);
> 18/07/26 22:49:20 WARN HiveMetaStore: Location: 
> file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external 
> table:tbl2
> spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') 
> SELECT * FROM tbl where 1=0;
> 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2
> 18/07/26 22:49:36 WARN log: Updated size to 0
> spark-sql> SHOW PARTITIONS tbl2;
> day=2018-07-25/hour=01
> spark-sql> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24939) Support YARN Shared Cache in Spark

2018-07-26 Thread Jonathan Bender (JIRA)

Jonathan Bender created SPARK-24939:
---

 Summary: Support YARN Shared Cache in Spark
 Key: SPARK-24939
 URL: https://issues.apache.org/jira/browse/SPARK-24939
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.3.1
Reporter: Jonathan Bender


https://issues.apache.org/jira/browse/YARN-1492 introduced support for the YARN 
Shared Cache, which when configured allows clients to cache submitted 
application resources (jars, archives) in HDFS and avoid having to re-upload 
them for successive jobs. MapReduce YARN applications support this feature, it 
would be great to add support for it in Spark as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24918) Executor Plugin API

2018-07-26 Thread Imran Rashid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558829#comment-16558829
 ] 

Imran Rashid commented on SPARK-24918:
--

The only thing I *really* needed was just to be able to instantiate some 
arbitrary class when the executor starts up.  My instrumentation code could do 
the rest via reflection from there.  But I might want more eventually, eg. with 
task start & end events, I could imagine setting something up to periodically 
take stack traces only for if there is a stage running in stage X or for longer 
than Y seconds etc.

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24801) Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory

2018-07-26 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-24801:
-
Labels: memory-analysis  (was: )

> Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can 
> waste a lot of memory
> ---
>
> Key: SPARK-24801
> URL: https://issues.apache.org/jira/browse/SPARK-24801
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Priority: Major
>  Labels: memory-analysis
>
> I recently analyzed another Yarn NM heap dump with jxray 
> ([www.jxray.com),|http://www.jxray.com),/] and found that 81% of memory is 
> wasted by empty (all zeroes) byte[] arrays. Most of these arrays are 
> referenced by 
> {{org.apache.spark.network.util.ByteArrayWritableChannel.data}}, and these in 
> turn come from 
> {{spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel}}. Here is 
> the full reference chain that leads to the problematic arrays:
> {code:java}
> 2,597,946K (64.1%): byte[]: 40583 / 100% of empty 2,597,946K (64.1%)
> ↖org.apache.spark.network.util.ByteArrayWritableChannel.data
> ↖org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.msg
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.{next}
> ↖io.netty.channel.ChannelOutboundBuffer.flushedEntry
> ↖io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer
> ↖io.netty.channel.socket.nio.NioSocketChannel.unsafe
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance{code}
>  
> Checking the code of {{SaslEncryption$EncryptedMessage}}, I see that 
> byteChannel is always initialized eagerly in the constructor:
> {code:java}
> this.byteChannel = new ByteArrayWritableChannel(maxOutboundBlockSize);{code}
> So I think to address the problem of empty byte[] arrays flooding the memory, 
> we should initialize {{byteChannel}} lazily, upon the first use. As far as I 
> can see, it's used only in one method, {{private void nextChunk()}}.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2018-07-26 Thread Imran Rashid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558798#comment-16558798
 ] 

Imran Rashid commented on SPARK-24938:
--

This should be an easy change to make, its just a question of running a test.  
I keep meaning to do it, but have too many other things in flight, so anybody 
is welcome to do it.  You could use SPARK-24918 and 
https://github.com/squito/spark-memory to check the memory usage before and 
after.

> Understand usage of netty's onheap memory use, even with offheap pools
> --
>
> Key: SPARK-24938
> URL: https://issues.apache.org/jira/browse/SPARK-24938
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> We've observed that netty uses large amount of onheap memory in its pools, in 
> addition to the expected offheap memory when I added some instrumentation 
> (using SPARK-24918 and https://github.com/squito/spark-memory). We should 
> figure out why its using that memory, and whether its really necessary.
> It might be just this one line:
> https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82
> which means that even with a small burst of messages, each arena will grow by 
> 16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
> Switching to requesting a buffer from the default pool would probably fix 
> this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2018-07-26 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-24938:


 Summary: Understand usage of netty's onheap memory use, even with 
offheap pools
 Key: SPARK-24938
 URL: https://issues.apache.org/jira/browse/SPARK-24938
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Imran Rashid


We've observed that netty uses large amount of onheap memory in its pools, in 
addition to the expected offheap memory when I added some instrumentation 
(using SPARK-24918 and https://github.com/squito/spark-memory). We should 
figure out why its using that memory, and whether its really necessary.

It might be just this one line:
https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82

which means that even with a small burst of messages, each arena will grow by 
16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
Switching to requesting a buffer from the default pool would probably fix this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23633) Update Pandas UDFs section in sql-programming-guide

2018-07-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23633:


Assignee: (was: Apache Spark)

>  Update Pandas UDFs section in sql-programming-guide
> 
>
> Key: SPARK-23633
> URL: https://issues.apache.org/jira/browse/SPARK-23633
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> Let's make sure sql-programming-guide is up-to-date before 2.4 release.
> https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#pandas-udfs-aka-vectorized-udfs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23633) Update Pandas UDFs section in sql-programming-guide

2018-07-26 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558796#comment-16558796
 ] 

Apache Spark commented on SPARK-23633:
--

User 'icexelloss' has created a pull request for this issue:
https://github.com/apache/spark/pull/21887

>  Update Pandas UDFs section in sql-programming-guide
> 
>
> Key: SPARK-23633
> URL: https://issues.apache.org/jira/browse/SPARK-23633
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> Let's make sure sql-programming-guide is up-to-date before 2.4 release.
> https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#pandas-udfs-aka-vectorized-udfs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23633) Update Pandas UDFs section in sql-programming-guide

2018-07-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23633:


Assignee: Apache Spark

>  Update Pandas UDFs section in sql-programming-guide
> 
>
> Key: SPARK-23633
> URL: https://issues.apache.org/jira/browse/SPARK-23633
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Assignee: Apache Spark
>Priority: Major
>
> Let's make sure sql-programming-guide is up-to-date before 2.4 release.
> https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#pandas-udfs-aka-vectorized-udfs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24795) Implement barrier execution mode

2018-07-26 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24795.
-
   Resolution: Fixed
 Assignee: Jiang Xingbo
Fix Version/s: 2.4.0

> Implement barrier execution mode
> 
>
> Key: SPARK-24795
> URL: https://issues.apache.org/jira/browse/SPARK-24795
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Implement barrier execution mode, as described in SPARK-24582
> Include all the API changes and basic implementation (except for 
> BarrierTaskContext.barrier())



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14543) SQL/Hive insertInto has unexpected results

2018-07-26 Thread Ryan Blue (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-14543.
---
Resolution: Later

This is addressed by SPARK-24251 for DataSourceV2 writers.

> SQL/Hive insertInto has unexpected results
> --
>
> Key: SPARK-14543
> URL: https://issues.apache.org/jira/browse/SPARK-14543
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>
> *Updated description*
> There should be an option to match input data to output columns by name. The 
> API allows operations on tables, which hide the column resolution problem. 
> It's easy to copy from one table to another without listing the columns, and 
> in the API it is common to work with columns by name rather than by position. 
> I think the API should add a way to match columns by name, which is closer to 
> what users expect. I propose adding something like this:
> {code}
> CREATE TABLE src (id: bigint, count: int, total: bigint)
> CREATE TABLE dst (id: bigint, total: bigint, count: int)
> sqlContext.table("src").write.byName.insertInto("dst")
> {code}
> *Original description*
> The Hive write path adds a pre-insertion cast (projection) to reconcile 
> incoming data columns with the outgoing table schema. Columns are matched by 
> position and casts are inserted to reconcile the two column schemas.
> When columns aren't correctly aligned, this causes unexpected results. I ran 
> into this by not using a correct {{partitionBy}} call (addressed by 
> SPARK-14459), which caused an error message that an int could not be cast to 
> an array. However, if the columns are vaguely compatible, for example string 
> and float, then no error or warning is produced and data is written to the 
> wrong columns using unexpected casts (string -> bigint -> float).
> A real-world use case that will hit this is when a table definition changes 
> by adding a column in the middle of a table. Spark SQL statements that copied 
> from that table to a destination table will then map the columns differently 
> but insert casts that mask the problem. The last column's data will be 
> dropped without a reliable warning for the user.
> This highlights a few problems:
> * Too many or too few incoming data columns should cause an AnalysisException 
> to be thrown
> * Only "safe" casts should be inserted automatically, like int -> long, using 
> UpCast
> * Pre-insertion casts currently ignore extra columns by using zip
> * The pre-insertion cast logic differs between Hive's MetastoreRelation and 
> LogicalRelation
> Also, I think there should be an option to match input data to output columns 
> by name. The API allows operations on tables, which hide the column 
> resolution problem. It's easy to copy from one table to another without 
> listing the columns, and in the API it is common to work with columns by name 
> rather than by position. I think the API should add a way to match columns by 
> name, which is closer to what users expect. I propose adding something like 
> this:
> {code}
> CREATE TABLE src (id: bigint, count: int, total: bigint)
> CREATE TABLE dst (id: bigint, total: bigint, count: int)
> sqlContext.table("src").write.byName.insertInto("dst")
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24882) separate responsibilities of the data source v2 read API

2018-07-26 Thread Ryan Blue (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558666#comment-16558666
 ] 

Ryan Blue edited comment on SPARK-24882 at 7/26/18 6:19 PM:


[~cloud_fan], I'm adding some suggestions here because comments on the doc are 
good for discussion, but not really for longer content.

I like the separation of the batch, micro-batch, and streaming classes. That 
works well. I also like the addition of the Metadata class, though I'd use a 
more specific name.

There are a few naming changes I would make to be more specific and to preserve 
existing names or naming conventions:
* Instead of ReaderProvider, I think we should use ReaderFactory because that 
name corresponds to the write path and accurately describes the class
* I think we should continue to use InputPartition instead of InputSplit, even 
if we introduce a reader factory. (Probably also rename SplitReader to 
PartitionReader.)
* Metadata isn't specific so I think we should use ScanConfig instead
* getSplits should be planSplits because "get" implies a quick operation (like 
returning a field's value) in Java. This is also consistent with the current 
API.

Next, instead of using mix-ins on Metadata / ScanConfig, I think a builder 
would help clarify the order of operations. If ScanConfig is mutable, then it 
could be passed to the other methods in different states. I'd rather use a 
Builder to make it Immutable. That way, implementations know that the Metadata 
/ ScanConfig doesn't change between calls to estimateStatistics and getSplits 
so results can be cached. To make this work, Spark would provide a Builder 
interface with default methods that do nothing. To implement pushdown, users 
just need to implement the methods. This also allows us to add new pushdown 
methods (like pushLimit) without introducing new interfaces.

I'd also like to see the classes reorganized a little to reduce the overall 
number of interfaces:

Metadata / ScanConfig contains all of the state that the DataSourceReader used 
to hold. If the DataSourceReader has no state, then its methods should be 
provided by the a single instance of the source instead. That would change the 
API to get rid of the Reader level and merge it into ReadSupport. Then 
ReadSupport would be used to create a ScanConfig and then BatchReadSupport (or 
similar) would be used to plan splits and get reader factories. I think this is 
easier for implementations.

{code:lang=java}
public interface ReadSupport {
  ScanConfig.Builder newScanBuilder();
}

public interface ReportsStatistics extends ReadSupport {
  Statistics estimateStatistics(ScanConfig)
}

public interface BatchReadSupport extends ReadSupport {
  InputSplit[] planSplits(ScanConfig)

  ReaderFactory readerFactory()
}

public interface MicroBatchReadSupport extends ScanSupport {
  InputPartition[] planSplits(ScanConfig, Offset start, Offset end)

  Offset initialOffset()

  MicroBatchReaderFactory readerFactory()
}

public interface ContinuousReadSupport extends ScanSupport {
  InputPartition[] planSplits(ScanConfig, Offset start)

  Offset initialOffset()

  ContinuousReaderFactory readerFactory()
}
{code}

Note that this change also cleans up the confusion around the use of Reader: 
the only Reader is a SplitReader that returns rows or row batches.

I would keep the same structure that you have for micro-batch, continuous, and 
batch ReaderFactory and SplitReader.

Here's a sketch of the ScanConfig and Builder I mentioned above:

{code:lang=java}
public interface ScanConfig {
  StructType schema()

  Filter[] pushedFilters()

  Expression[] pushedPredicates()

  // by default, the Builder doesn't push anything
  public interface Builder {
Builder pushProjection(...)
Builder pushFilters(...)
default Builder pushPredicates(...) {
  return this;
}
Builder pushLimit(...)
ScanConfig build()
  }
}
{code}


was (Author: rdblue):
[~cloud_fan], I'm adding some suggestions here because comments on the doc are 
good for discussion, but not really for longer content.

I like the separation of the batch, micro-batch, and streaming classes. That 
works well. I also like the addition of the Metadata class, though I'd use a 
more specific name.

There are a few naming changes I would make to be more specific and to preserve 
existing names or naming conventions:
* Instead of ReaderProvider, I think we should use ReaderFactory because that 
name corresponds to the write path and accurately describes the class
* I think we should continue to use InputPartition instead of InputSplit, even 
if we introduce a reader factory. (Probably also rename SplitReader to 
PartitionReader.)
* Metadata isn't specific so I think we should use ScanConfig instead
* getSplits should be planSplits because get implies a quick operation in Java. 
This is also consistent with the current API.

Next,

[jira] [Comment Edited] (SPARK-24882) separate responsibilities of the data source v2 read API

2018-07-26 Thread Ryan Blue (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558666#comment-16558666
 ] 

Ryan Blue edited comment on SPARK-24882 at 7/26/18 6:18 PM:


[~cloud_fan], I'm adding some suggestions here because comments on the doc are 
good for discussion, but not really for longer content.

I like the separation of the batch, micro-batch, and streaming classes. That 
works well. I also like the addition of the Metadata class, though I'd use a 
more specific name.

There are a few naming changes I would make to be more specific and to preserve 
existing names or naming conventions:
* Instead of ReaderProvider, I think we should use ReaderFactory because that 
name corresponds to the write path and accurately describes the class
* I think we should continue to use InputPartition instead of InputSplit, even 
if we introduce a reader factory. (Probably also rename SplitReader to 
PartitionReader.)
* Metadata isn't specific so I think we should use ScanConfig instead
* getSplits should be planSplits because get implies a quick operation in Java. 
This is also consistent with the current API.

Next, instead of using mix-ins on Metadata / ScanConfig, I think a builder 
would help clarify the order of operations. If ScanConfig is mutable, then it 
could be passed to the other methods in different states. I'd rather use a 
Builder to make it Immutable. That way, implementations know that the Metadata 
/ ScanConfig doesn't change between calls to estimateStatistics and getSplits 
so results can be cached. To make this work, Spark would provide a Builder 
interface with default methods that do nothing. To implement pushdown, users 
just need to implement the methods. This also allows us to add new pushdown 
methods (like pushLimit) without introducing new interfaces.

I'd also like to see the classes reorganized a little to reduce the overall 
number of interfaces:

Metadata / ScanConfig contains all of the state that the DataSourceReader used 
to hold. If the DataSourceReader has no state, then its methods should be 
provided by the a single instance of the source instead. That would change the 
API to get rid of the Reader level and merge it into ReadSupport. Then 
ReadSupport would be used to create a ScanConfig and then BatchReadSupport (or 
similar) would be used to plan splits and get reader factories. I think this is 
easier for implementations.

{code:lang=java}
public interface ReadSupport {
  ScanConfig.Builder newScanBuilder();
}

public interface ReportsStatistics extends ReadSupport {
  Statistics estimateStatistics(ScanConfig)
}

public interface BatchReadSupport extends ReadSupport {
  InputSplit[] planSplits(ScanConfig)

  ReaderFactory readerFactory()
}

public interface MicroBatchReadSupport extends ScanSupport {
  InputPartition[] planSplits(ScanConfig, Offset start, Offset end)

  Offset initialOffset()

  MicroBatchReaderFactory readerFactory()
}

public interface ContinuousReadSupport extends ScanSupport {
  InputPartition[] planSplits(ScanConfig, Offset start)

  Offset initialOffset()

  ContinuousReaderFactory readerFactory()
}
{code}

Note that this change also cleans up the confusion around the use of Reader: 
the only Reader is a SplitReader that returns rows or row batches.

I would keep the same structure that you have for micro-batch, continuous, and 
batch ReaderFactory and SplitReader.

Here's a sketch of the ScanConfig and Builder I mentioned above:

{code:lang=java}
public interface ScanConfig {
  StructType schema()

  Filter[] pushedFilters()

  Expression[] pushedPredicates()

  // by default, the Builder doesn't push anything
  public interface Builder {
Builder pushProjection(...)
Builder pushFilters(...)
default Builder pushPredicates(...) {
  return this;
}
Builder pushLimit(...)
ScanConfig build()
  }
}
{code}


was (Author: rdblue):
[~cloud_fan], I'm adding some suggestions here because comments on the doc are 
good for discussion, but not really for longer content.

I like the separation of the batch, micro-batch, and streaming classes. That 
works well. I also like the addition of the Metadata class, though I'd use a 
more specific name.

There are a few naming changes I would make to be more specific and to preserve 
existing names or naming conventions:
* Instead of ReaderProvider, I think we should use ReaderFactory because that 
name corresponds to the write path and accurately describes the class
* I think we should continue to use InputPartition instead of InputSplit, even 
if we introduce a reader factory. (Probably also rename SplitReader to 
PartitionReader.)
* Metadata isn't specific so I think we should use ScanConfig instead

Next, instead of using mix-ins on Metadata / ScanConfig, I think a builder 
would help clarify the order of operations. If ScanConfig is mutable, then it 
could be pass

[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2018-07-26 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558687#comment-16558687
 ] 

Apache Spark commented on SPARK-21274:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/21886

> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24926) Ensure numCores is used consistently in all netty configuration (driver and executors)

2018-07-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24926:


Assignee: Apache Spark

> Ensure numCores is used consistently in all netty configuration (driver and 
> executors)
> --
>
> Key: SPARK-24926
> URL: https://issues.apache.org/jira/browse/SPARK-24926
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Major
>  Labels: memory-analysis
>
> I think there may be some places where we're not passing the right number of 
> configured cores to netty -- in particular in driver mode, we're not 
> respecting "spark.driver.cores".  This means that spark will configure netty 
> will be configured based on the number of physical cores of the device, 
> instead of whatever resources spark requested from the cluster manager.  It 
> controls both the number of threads netty uses *and* the number of arenas in 
> its memory pools.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24926) Ensure numCores is used consistently in all netty configuration (driver and executors)

2018-07-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24926:


Assignee: (was: Apache Spark)

> Ensure numCores is used consistently in all netty configuration (driver and 
> executors)
> --
>
> Key: SPARK-24926
> URL: https://issues.apache.org/jira/browse/SPARK-24926
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> I think there may be some places where we're not passing the right number of 
> configured cores to netty -- in particular in driver mode, we're not 
> respecting "spark.driver.cores".  This means that spark will configure netty 
> will be configured based on the number of physical cores of the device, 
> instead of whatever resources spark requested from the cluster manager.  It 
> controls both the number of threads netty uses *and* the number of arenas in 
> its memory pools.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24926) Ensure numCores is used consistently in all netty configuration (driver and executors)

2018-07-26 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558688#comment-16558688
 ] 

Apache Spark commented on SPARK-24926:
--

User 'NiharS' has created a pull request for this issue:
https://github.com/apache/spark/pull/21885

> Ensure numCores is used consistently in all netty configuration (driver and 
> executors)
> --
>
> Key: SPARK-24926
> URL: https://issues.apache.org/jira/browse/SPARK-24926
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> I think there may be some places where we're not passing the right number of 
> configured cores to netty -- in particular in driver mode, we're not 
> respecting "spark.driver.cores".  This means that spark will configure netty 
> will be configured based on the number of physical cores of the device, 
> instead of whatever resources spark requested from the cluster manager.  It 
> controls both the number of threads netty uses *and* the number of arenas in 
> its memory pools.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files

2018-07-26 Thread Cheng Lian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-24927:
--

Assignee: Cheng Lian

> The hadoop-provided profile doesn't play well with Snappy-compressed Parquet 
> files
> --
>
> Key: SPARK-24927
> URL: https://issues.apache.org/jira/browse/SPARK-24927
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Major
>
> Reproduction:
> {noformat}
> wget 
> https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
> wget 
> https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
> tar xzf spark-2.3.1-bin-without-hadoop.tgz
> tar xzf hadoop-2.7.3.tar.gz
> export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath)
> ./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local
> ...
> scala> 
> spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet")
> {noformat}
> Exception:
> {noformat}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>   ... 69 more
> Caused by: org.apache.spark.SparkException: Task failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
>   at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
>   at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
>   at 
> org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
>   at 
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
>   at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
>   at

[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API

2018-07-26 Thread Ryan Blue (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558666#comment-16558666
 ] 

Ryan Blue commented on SPARK-24882:
---

[~cloud_fan], I'm adding some suggestions here because comments on the doc are 
good for discussion, but not really for longer content.

I like the separation of the batch, micro-batch, and streaming classes. That 
works well. I also like the addition of the Metadata class, though I'd use a 
more specific name.

There are a few naming changes I would make to be more specific and to preserve 
existing names or naming conventions:
* Instead of ReaderProvider, I think we should use ReaderFactory because that 
name corresponds to the write path and accurately describes the class
* I think we should continue to use InputPartition instead of InputSplit, even 
if we introduce a reader factory. (Probably also rename SplitReader to 
PartitionReader.)
* Metadata isn't specific so I think we should use ScanConfig instead

Next, instead of using mix-ins on Metadata / ScanConfig, I think a builder 
would help clarify the order of operations. If ScanConfig is mutable, then it 
could be passed to the other methods in different states. I'd rather use a 
Builder to make it Immutable. That way, implementations know that the Metadata 
/ ScanConfig doesn't change between calls to estimateStatistics and getSplits 
so results can be cached. To make this work, Spark would provide a Builder 
interface with default methods that do nothing. To implement pushdown, users 
just need to implement the methods. This also allows us to add new pushdown 
methods (like pushLimit) without introducing new interfaces.

I'd also like to see the classes reorganized a little to reduce the overall 
number of interfaces:

Metadata / ScanConfig contains all of the state that the DataSourceReader used 
to hold. If the DataSourceReader has no state, then its methods should be 
provided by the a single instance of the source instead. That would change the 
API to get rid of the Reader level and merge it into ReadSupport. Then 
ReadSupport would be used to create a ScanConfig and then BatchReadSupport (or 
similar) would be used to plan splits and get reader factories. I think this is 
easier for implementations.

{code:lang=java}
public interface ReadSupport {
  ScanConfig.Builder newScanBuilder();
}

public interface ReportsStatistics extends ReadSupport {
  Statistics estimateStatistics(ScanConfig)
}

public interface BatchReadSupport extends ReadSupport {
  InputSplit[] getSplits(ScanConfig)

  ReaderFactory readerFactory()
}

public interface MicroBatchReadSupport extends ScanSupport {
  InputPartition[] getSplits(ScanConfig, Offset start, Offset end)

  Offset initialOffset()

  MicroBatchReaderFactory readerFactory()
}

public interface ContinuousReadSupport extends ScanSupport {
  InputPartition[] getSplits(ScanConfig, Offset start)

  Offset initialOffset()

  ContinuousReaderFactory readerFactory()
}
{code}

Note that this change also cleans up the confusion around the use of Reader: 
the only Reader is a SplitReader that returns rows or row batches.

I would keep the same structure that you have for micro-batch, continuous, and 
batch ReaderFactory and SplitReader.

Here's a sketch of the ScanConfig and Builder I mentioned above:

{code:lang=java}
public interface ScanConfig {
  StructType schema()

  Filter[] pushedFilters()

  Expression[] pushedPredicates()

  // by default, the Builder doesn't push anything
  public interface Builder {
Builder pushProjection(...)
Builder pushFilters(...)
default Builder pushPredicates(...) {
  return this;
}
Builder pushLimit(...)
ScanConfig build()
  }
}
{code}

> separate responsibilities of the data source v2 read API
> 
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 read API. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

--

[jira] [Commented] (SPARK-23683) FileCommitProtocol.instantiate to require 3-arg constructor for dynamic partition overwrite

2018-07-26 Thread Steve Loughran (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558660#comment-16558660
 ] 

Steve Loughran commented on SPARK-23683:


If it's a regression, you could argue for it

> FileCommitProtocol.instantiate to require 3-arg constructor for dynamic 
> partition overwrite
> ---
>
> Key: SPARK-23683
> URL: https://issues.apache.org/jira/browse/SPARK-23683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Fix For: 2.4.0
>
>
> with SPARK-20236 {{FileCommitProtocol.instantiate()}} looks for a three 
> argument constructor, passing in the {{dynamicPartitionOverwrite}} parameter. 
> If there is no such constructor, it falls back to the classic two-arg one.
> When {{InsertIntoHadoopFsRelationCommand}} passes down that 
> {{dynamicPartitionOverwrite}} flag to  {{FileCommitProtocol.instantiate()}}, 
> it _assumes_ that the instantiated protocol supports the specific 
> requirements of dynamic partition overwrite. It does not notice when this 
> does not hold, and so the output generated may be incorrect.
> Proposed: when dynamicPartitionOverwrite == true, require the protocol 
> implementation to have a 3-arg constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files

2018-07-26 Thread Cheng Lian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-24927:
---
Description: 
Reproduction:
{noformat}
wget 
https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
wget 
https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz

tar xzf spark-2.3.1-bin-without-hadoop.tgz
tar xzf hadoop-2.7.3.tar.gz

export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath)
./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local
...
scala> 
spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet")
{noformat}
Exception:
{noformat}
Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at scala.Option.foreach(Option.scala:257)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
  ... 69 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows.
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:109)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsatisfiedLinkError: 
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
  at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
  at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
  at 
org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
  at 
org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
  at 
org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
  at 
org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
  at 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
  at 
org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
  at 
org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
  at 
org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
  at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
  at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
  at 
org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFo

[jira] [Updated] (SPARK-24934) Complex type and binary type in in-memory partition pruning does not work due to missing upper/lower bounds cases

2018-07-26 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24934:
-
Summary: Complex type and binary type in in-memory partition pruning does 
not work due to missing upper/lower bounds cases  (was: Should handle missing 
upper/lower bounds cases in in-memory partition pruning)

> Complex type and binary type in in-memory partition pruning does not work due 
> to missing upper/lower bounds cases
> -
>
> Key: SPARK-24934
> URL: https://issues.apache.org/jira/browse/SPARK-24934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> For example, if array is used (where the lower and upper bounds for its 
> column batch are {{null}})), it looks wrongly filtering all data out:
> {code}
> scala> import org.apache.spark.sql.functions
> import org.apache.spark.sql.functions
> scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol")
> df: org.apache.spark.sql.DataFrame = [arrayCol: array]
> scala> 
> df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), 
> functions.lit("b".show()
> ++
> |arrayCol|
> ++
> |  [a, b]|
> ++
> scala> 
> df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"),
>  functions.lit("b".show()
> ++
> |arrayCol|
> ++
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24937) Datasource partition table should load empty partitions

2018-07-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24937:


Assignee: (was: Apache Spark)

> Datasource partition table should load empty partitions
> ---
>
> Key: SPARK-24937
> URL: https://issues.apache.org/jira/browse/SPARK-24937
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> spark-sql> CREATE TABLE tbl AS SELECT 1;
> spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING)
>  > USING parquet
>  > PARTITIONED BY (day, hour);
> spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') 
> SELECT * FROM tbl where 1=0;
> spark-sql> SHOW PARTITIONS tbl1;
> spark-sql> CREATE TABLE tbl2 (c1 BIGINT)
>  > PARTITIONED BY (day STRING, hour STRING);
> 18/07/26 22:49:20 WARN HiveMetaStore: Location: 
> file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external 
> table:tbl2
> spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') 
> SELECT * FROM tbl where 1=0;
> 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2
> 18/07/26 22:49:36 WARN log: Updated size to 0
> spark-sql> SHOW PARTITIONS tbl2;
> day=2018-07-25/hour=01
> spark-sql> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24937) Datasource partition table should load empty partitions

2018-07-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24937:


Assignee: Apache Spark

> Datasource partition table should load empty partitions
> ---
>
> Key: SPARK-24937
> URL: https://issues.apache.org/jira/browse/SPARK-24937
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> How to reproduce:
> {code:sql}
> spark-sql> CREATE TABLE tbl AS SELECT 1;
> spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING)
>  > USING parquet
>  > PARTITIONED BY (day, hour);
> spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') 
> SELECT * FROM tbl where 1=0;
> spark-sql> SHOW PARTITIONS tbl1;
> spark-sql> CREATE TABLE tbl2 (c1 BIGINT)
>  > PARTITIONED BY (day STRING, hour STRING);
> 18/07/26 22:49:20 WARN HiveMetaStore: Location: 
> file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external 
> table:tbl2
> spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') 
> SELECT * FROM tbl where 1=0;
> 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2
> 18/07/26 22:49:36 WARN log: Updated size to 0
> spark-sql> SHOW PARTITIONS tbl2;
> day=2018-07-25/hour=01
> spark-sql> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24937) Datasource partition table should load empty partitions

2018-07-26 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558438#comment-16558438
 ] 

Apache Spark commented on SPARK-24937:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/21883

> Datasource partition table should load empty partitions
> ---
>
> Key: SPARK-24937
> URL: https://issues.apache.org/jira/browse/SPARK-24937
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> spark-sql> CREATE TABLE tbl AS SELECT 1;
> spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING)
>  > USING parquet
>  > PARTITIONED BY (day, hour);
> spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') 
> SELECT * FROM tbl where 1=0;
> spark-sql> SHOW PARTITIONS tbl1;
> spark-sql> CREATE TABLE tbl2 (c1 BIGINT)
>  > PARTITIONED BY (day STRING, hour STRING);
> 18/07/26 22:49:20 WARN HiveMetaStore: Location: 
> file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external 
> table:tbl2
> spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') 
> SELECT * FROM tbl where 1=0;
> 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2
> 18/07/26 22:49:36 WARN log: Updated size to 0
> spark-sql> SHOW PARTITIONS tbl2;
> day=2018-07-25/hour=01
> spark-sql> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24937) Datasource partition table should load empty partitions

2018-07-26 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24937:

Description: 
How to reproduce:
{code:sql}
spark-sql> CREATE TABLE tbl AS SELECT 1;
spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING)
 > USING parquet
 > PARTITIONED BY (day, hour);
spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') 
SELECT * FROM tbl where 1=0;
spark-sql> SHOW PARTITIONS tbl1;
spark-sql> CREATE TABLE tbl2 (c1 BIGINT)
 > PARTITIONED BY (day STRING, hour STRING);
18/07/26 22:49:20 WARN HiveMetaStore: Location: 
file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external 
table:tbl2
spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') 
SELECT * FROM tbl where 1=0;
18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2
18/07/26 22:49:36 WARN log: Updated size to 0
spark-sql> SHOW PARTITIONS tbl2;
day=2018-07-25/hour=01
spark-sql> 
{code}

  was:
How to reproduce:
{code:sql}
spark-sql> CREATE TABLE tbl AS SELECT 1;
18/07/26 22:48:11 WARN HiveMetaStore: Location: 
file:/Users/yumwang/tmp/spark/spark-warehouse/tbl specified for non-external 
table:tbl
18/07/26 22:48:15 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING)
 > USING parquet
 > PARTITIONED BY (day, hour);
spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') 
SELECT * FROM tbl where 1=0;
spark-sql> SHOW PARTITIONS tbl1;
spark-sql> CREATE TABLE tbl2 (c1 BIGINT)
 > PARTITIONED BY (day STRING, hour STRING);
18/07/26 22:49:20 WARN HiveMetaStore: Location: 
file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external 
table:tbl2
spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') 
SELECT * FROM tbl where 1=0;
18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2
18/07/26 22:49:36 WARN log: Updated size to 0
spark-sql> SHOW PARTITIONS tbl2;
day=2018-07-25/hour=01
spark-sql> 
{code}


> Datasource partition table should load empty partitions
> ---
>
> Key: SPARK-24937
> URL: https://issues.apache.org/jira/browse/SPARK-24937
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> spark-sql> CREATE TABLE tbl AS SELECT 1;
> spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING)
>  > USING parquet
>  > PARTITIONED BY (day, hour);
> spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') 
> SELECT * FROM tbl where 1=0;
> spark-sql> SHOW PARTITIONS tbl1;
> spark-sql> CREATE TABLE tbl2 (c1 BIGINT)
>  > PARTITIONED BY (day STRING, hour STRING);
> 18/07/26 22:49:20 WARN HiveMetaStore: Location: 
> file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external 
> table:tbl2
> spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') 
> SELECT * FROM tbl where 1=0;
> 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2
> 18/07/26 22:49:36 WARN log: Updated size to 0
> spark-sql> SHOW PARTITIONS tbl2;
> day=2018-07-25/hour=01
> spark-sql> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24937) Datasource partition table should load empty partitions

2018-07-26 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558386#comment-16558386
 ] 

Yuming Wang commented on SPARK-24937:
-

I'm working on.

> Datasource partition table should load empty partitions
> ---
>
> Key: SPARK-24937
> URL: https://issues.apache.org/jira/browse/SPARK-24937
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> spark-sql> CREATE TABLE tbl AS SELECT 1;
> 18/07/26 22:48:11 WARN HiveMetaStore: Location: 
> file:/Users/yumwang/tmp/spark/spark-warehouse/tbl specified for non-external 
> table:tbl
> 18/07/26 22:48:15 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING)
>  > USING parquet
>  > PARTITIONED BY (day, hour);
> spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') 
> SELECT * FROM tbl where 1=0;
> spark-sql> SHOW PARTITIONS tbl1;
> spark-sql> CREATE TABLE tbl2 (c1 BIGINT)
>  > PARTITIONED BY (day STRING, hour STRING);
> 18/07/26 22:49:20 WARN HiveMetaStore: Location: 
> file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external 
> table:tbl2
> spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') 
> SELECT * FROM tbl where 1=0;
> 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2
> 18/07/26 22:49:36 WARN log: Updated size to 0
> spark-sql> SHOW PARTITIONS tbl2;
> day=2018-07-25/hour=01
> spark-sql> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24937) Datasource partition table should load empty partitions

2018-07-26 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-24937:
---

 Summary: Datasource partition table should load empty partitions
 Key: SPARK-24937
 URL: https://issues.apache.org/jira/browse/SPARK-24937
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Yuming Wang


{code:sql}
spark-sql> CREATE TABLE tbl AS SELECT 1;
18/07/26 22:48:11 WARN HiveMetaStore: Location: 
file:/Users/yumwang/tmp/spark/spark-warehouse/tbl specified for non-external 
table:tbl
18/07/26 22:48:15 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING)
 > USING parquet
 > PARTITIONED BY (day, hour);
spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') 
SELECT * FROM tbl where 1=0;
spark-sql> SHOW PARTITIONS tbl1;
spark-sql> CREATE TABLE tbl2 (c1 BIGINT)
 > PARTITIONED BY (day STRING, hour STRING);
18/07/26 22:49:20 WARN HiveMetaStore: Location: 
file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external 
table:tbl2
spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') 
SELECT * FROM tbl where 1=0;
18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2
18/07/26 22:49:36 WARN log: Updated size to 0
spark-sql> SHOW PARTITIONS tbl2;
day=2018-07-25/hour=01
spark-sql> 
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24937) Datasource partition table should load empty partitions

2018-07-26 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24937:

Description: 
How to reproduce:
{code:sql}
spark-sql> CREATE TABLE tbl AS SELECT 1;
18/07/26 22:48:11 WARN HiveMetaStore: Location: 
file:/Users/yumwang/tmp/spark/spark-warehouse/tbl specified for non-external 
table:tbl
18/07/26 22:48:15 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING)
 > USING parquet
 > PARTITIONED BY (day, hour);
spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') 
SELECT * FROM tbl where 1=0;
spark-sql> SHOW PARTITIONS tbl1;
spark-sql> CREATE TABLE tbl2 (c1 BIGINT)
 > PARTITIONED BY (day STRING, hour STRING);
18/07/26 22:49:20 WARN HiveMetaStore: Location: 
file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external 
table:tbl2
spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') 
SELECT * FROM tbl where 1=0;
18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2
18/07/26 22:49:36 WARN log: Updated size to 0
spark-sql> SHOW PARTITIONS tbl2;
day=2018-07-25/hour=01
spark-sql> 
{code}

  was:
{code:sql}
spark-sql> CREATE TABLE tbl AS SELECT 1;
18/07/26 22:48:11 WARN HiveMetaStore: Location: 
file:/Users/yumwang/tmp/spark/spark-warehouse/tbl specified for non-external 
table:tbl
18/07/26 22:48:15 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING)
 > USING parquet
 > PARTITIONED BY (day, hour);
spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') 
SELECT * FROM tbl where 1=0;
spark-sql> SHOW PARTITIONS tbl1;
spark-sql> CREATE TABLE tbl2 (c1 BIGINT)
 > PARTITIONED BY (day STRING, hour STRING);
18/07/26 22:49:20 WARN HiveMetaStore: Location: 
file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external 
table:tbl2
spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') 
SELECT * FROM tbl where 1=0;
18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2
18/07/26 22:49:36 WARN log: Updated size to 0
spark-sql> SHOW PARTITIONS tbl2;
day=2018-07-25/hour=01
spark-sql> 
{code}


> Datasource partition table should load empty partitions
> ---
>
> Key: SPARK-24937
> URL: https://issues.apache.org/jira/browse/SPARK-24937
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> spark-sql> CREATE TABLE tbl AS SELECT 1;
> 18/07/26 22:48:11 WARN HiveMetaStore: Location: 
> file:/Users/yumwang/tmp/spark/spark-warehouse/tbl specified for non-external 
> table:tbl
> 18/07/26 22:48:15 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> spark-sql> CREATE TABLE tbl1 (c1 BIGINT, day STRING, hour STRING)
>  > USING parquet
>  > PARTITIONED BY (day, hour);
> spark-sql> INSERT INTO TABLE tbl1 PARTITION (day = '2018-07-25', hour='01') 
> SELECT * FROM tbl where 1=0;
> spark-sql> SHOW PARTITIONS tbl1;
> spark-sql> CREATE TABLE tbl2 (c1 BIGINT)
>  > PARTITIONED BY (day STRING, hour STRING);
> 18/07/26 22:49:20 WARN HiveMetaStore: Location: 
> file:/Users/yumwang/tmp/spark/spark-warehouse/tbl2 specified for non-external 
> table:tbl2
> spark-sql> INSERT INTO TABLE tbl2 PARTITION (day = '2018-07-25', hour='01') 
> SELECT * FROM tbl where 1=0;
> 18/07/26 22:49:36 WARN log: Updating partition stats fast for: tbl2
> 18/07/26 22:49:36 WARN log: Updated size to 0
> spark-sql> SHOW PARTITIONS tbl2;
> day=2018-07-25/hour=01
> spark-sql> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24934) Should handle missing upper/lower bounds cases in in-memory partition pruning

2018-07-26 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558364#comment-16558364
 ] 

Hyukjin Kwon commented on SPARK-24934:
--

np! BTW, the workaround will be turning off 
{{spark.sql.inMemoryColumnarStorage.partitionPruning}} although it'd be less 
performant.

> Should handle missing upper/lower bounds cases in in-memory partition pruning
> -
>
> Key: SPARK-24934
> URL: https://issues.apache.org/jira/browse/SPARK-24934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> For example, if array is used (where the lower and upper bounds for its 
> column batch are {{null}})), it looks wrongly filtering all data out:
> {code}
> scala> import org.apache.spark.sql.functions
> import org.apache.spark.sql.functions
> scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol")
> df: org.apache.spark.sql.DataFrame = [arrayCol: array]
> scala> 
> df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), 
> functions.lit("b".show()
> ++
> |arrayCol|
> ++
> |  [a, b]|
> ++
> scala> 
> df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"),
>  functions.lit("b".show()
> ++
> |arrayCol|
> ++
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24936) Better error message when trying a shuffle fetch over 2 GB

2018-07-26 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-24936:


 Summary: Better error message when trying a shuffle fetch over 2 GB
 Key: SPARK-24936
 URL: https://issues.apache.org/jira/browse/SPARK-24936
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Imran Rashid


After SPARK-24297, spark will try to fetch shuffle blocks to disk if their over 
2GB.  However, this will fail with an external shuffle service running < spark 
2.2, with an unhelpful error message like:

{noformat}
18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 
(TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1
, xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message=
org.apache.spark.shuffle.FetchFailedException: 
java.lang.UnsupportedOperationException
at 
org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60)
at 
org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
...
{noformat}

We can't do anything to make the shuffle succeed, in this situation, but we 
should fail with a better error message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24934) Should handle missing upper/lower bounds cases in in-memory partition pruning

2018-07-26 Thread David Vogelbacher (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558360#comment-16558360
 ] 

David Vogelbacher commented on SPARK-24934:
---

Thanks for opening and making the pr [~hyukjin.kwon]!

> Should handle missing upper/lower bounds cases in in-memory partition pruning
> -
>
> Key: SPARK-24934
> URL: https://issues.apache.org/jira/browse/SPARK-24934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> For example, if array is used (where the lower and upper bounds for its 
> column batch are {{null}})), it looks wrongly filtering all data out:
> {code}
> scala> import org.apache.spark.sql.functions
> import org.apache.spark.sql.functions
> scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol")
> df: org.apache.spark.sql.DataFrame = [arrayCol: array]
> scala> 
> df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), 
> functions.lit("b".show()
> ++
> |arrayCol|
> ++
> |  [a, b]|
> ++
> scala> 
> df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"),
>  functions.lit("b".show()
> ++
> |arrayCol|
> ++
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24935) Problem with Executing Hive UDF's from Spark 2.2 Onwards

2018-07-26 Thread Parth Gandhi (JIRA)

Parth Gandhi created SPARK-24935:


 Summary: Problem with Executing Hive UDF's from Spark 2.2 Onwards
 Key: SPARK-24935
 URL: https://issues.apache.org/jira/browse/SPARK-24935
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1, 2.2.0
Reporter: Parth Gandhi


A user of sketches library(https://github.com/DataSketches/sketches-hive) 
reported an issue with HLL Sketch Hive UDAF that seems to be a bug in Spark or 
Hive. Their code runs fine in 2.1 but has an issue from 2.2 onwards. For more 
details on the issue, you can refer to the discussion in the sketches-user list:
[https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ]

 

On further debugging, we figured out that from 2.2 onwards, Spark hive UDAF 
provides support for partial aggregation, and has removed the functionality 
that supported complete mode aggregation(Refer 
https://issues.apache.org/jira/browse/SPARK-19060 and 
https://issues.apache.org/jira/browse/SPARK-18186). Thus, instead of expecting 
update method to be called, merge method is called here 
([https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56)]
 which throws the exception as described in the forums above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24918) Executor Plugin API

2018-07-26 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558331#comment-16558331
 ] 

Thomas Graves commented on SPARK-24918:
---

I think this is a good idea. I thought I had seen a Jira around this before but 
couldn't find it.  It might have been a task run pre-hook 

Its also good question about what we tie into it.   I haven't looked at the 
details of your spark-memory debugging module, what does that class all need?  

I could see people doing all sorts of things from checking node health to 
preloading something, etc.  so we should definitely think about the 
possibilities here and what we may or may not want to allow.

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24934) Should handle missing upper/lower bounds cases in in-memory partition pruning

2018-07-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24934:


Assignee: Apache Spark

> Should handle missing upper/lower bounds cases in in-memory partition pruning
> -
>
> Key: SPARK-24934
> URL: https://issues.apache.org/jira/browse/SPARK-24934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> For example, if array is used (where the lower and upper bounds for its 
> column batch are {{null}})), it looks wrongly filtering all data out:
> {code}
> scala> import org.apache.spark.sql.functions
> import org.apache.spark.sql.functions
> scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol")
> df: org.apache.spark.sql.DataFrame = [arrayCol: array]
> scala> 
> df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), 
> functions.lit("b".show()
> ++
> |arrayCol|
> ++
> |  [a, b]|
> ++
> scala> 
> df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"),
>  functions.lit("b".show()
> ++
> |arrayCol|
> ++
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24934) Should handle missing upper/lower bounds cases in in-memory partition pruning

2018-07-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24934:


Assignee: (was: Apache Spark)

> Should handle missing upper/lower bounds cases in in-memory partition pruning
> -
>
> Key: SPARK-24934
> URL: https://issues.apache.org/jira/browse/SPARK-24934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> For example, if array is used (where the lower and upper bounds for its 
> column batch are {{null}})), it looks wrongly filtering all data out:
> {code}
> scala> import org.apache.spark.sql.functions
> import org.apache.spark.sql.functions
> scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol")
> df: org.apache.spark.sql.DataFrame = [arrayCol: array]
> scala> 
> df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), 
> functions.lit("b".show()
> ++
> |arrayCol|
> ++
> |  [a, b]|
> ++
> scala> 
> df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"),
>  functions.lit("b".show()
> ++
> |arrayCol|
> ++
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24934) Should handle missing upper/lower bounds cases in in-memory partition pruning

2018-07-26 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558295#comment-16558295
 ] 

Apache Spark commented on SPARK-24934:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/21882

> Should handle missing upper/lower bounds cases in in-memory partition pruning
> -
>
> Key: SPARK-24934
> URL: https://issues.apache.org/jira/browse/SPARK-24934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> For example, if array is used (where the lower and upper bounds for its 
> column batch are {{null}})), it looks wrongly filtering all data out:
> {code}
> scala> import org.apache.spark.sql.functions
> import org.apache.spark.sql.functions
> scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol")
> df: org.apache.spark.sql.DataFrame = [arrayCol: array]
> scala> 
> df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), 
> functions.lit("b".show()
> ++
> |arrayCol|
> ++
> |  [a, b]|
> ++
> scala> 
> df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"),
>  functions.lit("b".show()
> ++
> |arrayCol|
> ++
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24934) Should handle missing upper/lower bounds cases in in-memory partition pruning

2018-07-26 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-24934:


 Summary: Should handle missing upper/lower bounds cases in 
in-memory partition pruning
 Key: SPARK-24934
 URL: https://issues.apache.org/jira/browse/SPARK-24934
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Hyukjin Kwon


For example, if array is used (where the lower and upper bounds for its column 
batch are {{null}})), it looks wrongly filtering all data out:

{code}
scala> import org.apache.spark.sql.functions
import org.apache.spark.sql.functions

scala> val df = Seq(Array("a", "b"), Array("c", "d")).toDF("arrayCol")
df: org.apache.spark.sql.DataFrame = [arrayCol: array]

scala> 
df.filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"), 
functions.lit("b".show()
++
|arrayCol|
++
|  [a, b]|
++


scala> 
df.cache().filter(df.col("arrayCol").eqNullSafe(functions.array(functions.lit("a"),
 functions.lit("b".show()
++
|arrayCol|
++
++
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12911) Cacheing a dataframe causes array comparisons to fail (in filter / where) after 1.6

2018-07-26 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-12911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558291#comment-16558291
 ] 

Hyukjin Kwon commented on SPARK-12911:
--

I opened - SPARK-24934

> Cacheing a dataframe causes array comparisons to fail (in filter / where) 
> after 1.6
> ---
>
> Key: SPARK-12911
> URL: https://issues.apache.org/jira/browse/SPARK-12911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.6.0
>Reporter: Jesse English
>Priority: Major
>
> When doing a *where* operation on a dataframe and testing for equality on an 
> array type, after 1.6 no valid comparisons are made if the dataframe has been 
> cached.  If it has not been cached, the results are as expected.
> This appears to be related to the underlying unsafe array data types.
> {code:title=test.scala|borderStyle=solid}
> test("test array comparison") {
> val vectors: Vector[Row] =  Vector(
>   Row.fromTuple("id_1" -> Array(0L, 2L)),
>   Row.fromTuple("id_2" -> Array(0L, 5L)),
>   Row.fromTuple("id_3" -> Array(0L, 9L)),
>   Row.fromTuple("id_4" -> Array(1L, 0L)),
>   Row.fromTuple("id_5" -> Array(1L, 8L)),
>   Row.fromTuple("id_6" -> Array(2L, 4L)),
>   Row.fromTuple("id_7" -> Array(5L, 6L)),
>   Row.fromTuple("id_8" -> Array(6L, 2L)),
>   Row.fromTuple("id_9" -> Array(7L, 0L))
> )
> val data: RDD[Row] = sc.parallelize(vectors, 3)
> val schema = StructType(
>   StructField("id", StringType, false) ::
> StructField("point", DataTypes.createArrayType(LongType, false), 
> false) ::
> Nil
> )
> val sqlContext = new SQLContext(sc)
> val dataframe = sqlContext.createDataFrame(data, schema)
> val targetPoint:Array[Long] = Array(0L,9L)
> //Cacheing is the trigger to cause the error (no cacheing causes no error)
> dataframe.cache()
> //This is the line where it fails
> //java.util.NoSuchElementException: next on empty iterator
> //However we know that there is a valid match
> val targetRow = dataframe.where(dataframe("point") === 
> array(targetPoint.map(value => lit(value)): _*)).first()
> assert(targetRow != null)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

2018-07-26 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558287#comment-16558287
 ] 

Marco Gaido commented on SPARK-24928:
-

The affected version is pretty old, can you check a newer version?

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24933) SinkProgress should report written rows

2018-07-26 Thread Vaclav Kosar (JIRA)

Vaclav Kosar created SPARK-24933:


 Summary: SinkProgress should report written rows
 Key: SPARK-24933
 URL: https://issues.apache.org/jira/browse/SPARK-24933
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.1
Reporter: Vaclav Kosar


SinkProgress should report similar properties like SourceProgress as long as 
they are available for given Sink. Count of written rows is metric availble for 
all Sinks. Since relevant progress information is with respect to commited 
rows, ideal object to carry this info is WriterCommitMessage. For brevity the 
implementation will focus only on Sinks with API V2 and on Micro Batch mode. 
Implemention for Continuous mode will be provided at later date.

h4. Before
{code}
{"description":"org.apache.spark.sql.kafka010.KafkaSourceProvider@3c0bd317"}
{code}

h4. After
{code}
{"description":"org.apache.spark.sql.kafka010.KafkaSourceProvider@3c0bd317","numOutputRows":5000}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24932) Allow update mode for streaming queries with join

2018-07-26 Thread Eric Fu (JIRA)

Eric Fu created SPARK-24932:
---

 Summary: Allow update mode for streaming queries with join
 Key: SPARK-24932
 URL: https://issues.apache.org/jira/browse/SPARK-24932
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.1, 2.3.0
Reporter: Eric Fu


In issue SPARK-19140 we supported update output mode for non-aggregation 
streaming queries. This should also be applied to streaming join to keep 
semantic consistent.

PS. Streaming join feature is added after SPARK-19140. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24931) CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which leading to job failed.

2018-07-26 Thread ice bai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-24931:

Priority: Major  (was: Blocker)

> CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which 
> leading to job failed.
> -
>
> Key: SPARK-24931
> URL: https://issues.apache.org/jira/browse/SPARK-24931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> when executor lost for some reason(e.g.  Unable to register with external 
> shuffle server)，CoarseGrainedExecutorBackend will send a RemoveExecutor event 
> with 'ExecutorLossReason'.   But this will cause TaskSetManager handle 
> handleFailedTask function with exitCausedByApp=true.   This is not correct



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24931) CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which leading to job failed.

2018-07-26 Thread ice bai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-24931:

Summary: CoarseGrainedExecutorBackend send wrong 'Reason' when executor 
exits which leading to job failed.  (was: ExecutorBackend send wrong Reason 
when executor exits)

> CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which 
> leading to job failed.
> -
>
> Key: SPARK-24931
> URL: https://issues.apache.org/jira/browse/SPARK-24931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Blocker
>
> when executor lost for some reason(e.g.  Unable to register with external 
> shuffle server)，CoarseGrainedExecutorBackend will send a RemoveExecutor event 
> with 'ExecutorLossReason'.   But this will cause TaskSetManager handle 
> handleFailedTask function with exitCausedByApp=true.   This is not correct



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24931) ExecutorBackend send wrong Reason when executor exits

2018-07-26 Thread ice bai (JIRA)

ice bai created SPARK-24931:
---

 Summary: ExecutorBackend send wrong Reason when executor exits
 Key: SPARK-24931
 URL: https://issues.apache.org/jira/browse/SPARK-24931
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: ice bai


when executor lost for some reason(e.g.  Unable to register with external 
shuffle server)，CoarseGrainedExecutorBackend will send a RemoveExecutor event 
with 'ExecutorLossReason'.   But this will cause TaskSetManager handle 
handleFailedTask function with exitCausedByApp=true.   This is not correct



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24527) select column alias should support quotation marks

2018-07-26 Thread ice bai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-24527:

Description: 
It will be failed when user use spark-sql or sql API to select come columns 
with quoted alias, but Hive is ok.  Such as :

select 'name' as 'nm';
 select 'name' as "nm";

  was:
It will be failed when user use spark-sql or sql API to select come columns 
with quoted alias, but Hive is well.  Such as :

select 'name' as 'nm';
select 'name' as "nm";


> select column alias should support quotation marks
> --
>
> Key: SPARK-24527
> URL: https://issues.apache.org/jira/browse/SPARK-24527
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Minor
>
> It will be failed when user use spark-sql or sql API to select come columns 
> with quoted alias, but Hive is ok.  Such as :
> select 'name' as 'nm';
>  select 'name' as "nm";



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24647) Sink Should Return Writen Offsets For ProgressReporting

2018-07-26 Thread Vaclav Kosar (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vaclav Kosar updated SPARK-24647:
-
Description: 
To be able to track data lineage for Structured Streaming (I intend to 
implement this to Open Source Project Spline), the monitoring needs to be able 
to not only to track where the data was read from but also where results were 
written to. This could be to my knowledge best implemented using monitoring 
{{StreamingQueryProgress}}. However currently written data offsets are not 
available on {{Sink}} or {{StreamWriter}} interface. Implementing as proposed 
would also bring symmetry to {{StreamingQueryProgress}} fields sources and sink.

 

*Similar Proposals*

Made in following jiras. These would not be sufficient for lineage tracking.
 * https://issues.apache.org/jira/browse/SPARK-18258
 * https://issues.apache.org/jira/browse/SPARK-21313

 

*Current State*
 * Method {{Sink#addBatch}} returns {{Unit}}.
 * Object {{WriterCommitMessage}} does not carry any progress information about 
comitted rows.
 * {{StreamingQueryProgress}} reports {{offsetSeq}} start and end using 
{{sourceProgress}} value but {{sinkProgress}} only calls {{toString}} method.

{code:java}
  "sources" : [ {
    "description" : "KafkaSource[Subscribe[test-topic]]",
    "startOffset" : null,
    "endOffset" : { "test-topic" : { "0" : 5000 }},
    "numInputRows" : 5000,
    "processedRowsPerSecond" : 645.3278265358803
  } ],
  "sink" : {
    "description" : 
"org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f"
  }
{code}
 

 

*Proposed State*
 * Implement support only for v2 sinks as those are to use used in future.
 * {{WriterCommitMessage}} to hold optional min and max offset information of 
commited rows e.g. Kafka does it by returning {{RecordMetadata}} object from 
{{send}} method.
 * {{StreamingQueryProgress}} incorporate {{sinkProgress}} in similar fashion 
as {{sourceProgress}}.

 

 
{code:java}
  "sources" : [ {
    "description" : "KafkaSource[Subscribe[test-topic]]",
    "startOffset" : null,
    "endOffset" : { "test-topic" : { "0" : 5000 }},
    "numInputRows" : 5000,
    "processedRowsPerSecond" : 645.3278265358803
  } ],
  "sink" : {
    "description" : 
"org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f",
   "startOffset" : null,
    "endOffset" { "sinkTopic": { "0": 333 }}
  }
{code}
 

*Implementation*
* PR submitters: Me and [~wajda] as soon as prerequisite jira is merged.
 * {{Sinks}}: Modify all v2 sinks to conform a new interface or return dummy 
values.
 * {{ProgressReporter}}: Merge offsets from different batches properly, 
similarly to how it is done for sources.

 

  was:
To be able to track data lineage for Structured Streaming (I intend to 
implement this to Open Source Project Spline), the monitoring needs to be able 
to not only to track where the data was read from but also where results were 
written to. This could be to my knowledge best implemented using monitoring 
{{StreamingQueryProgress}}. However currently batch data offsets are not 
available on {{Sink}} interface. Implementing as proposed would also bring 
symmetry to {{StreamingQueryProgress}} fields sources and sink.

 

*Similar Proposals*

Made in following jiras. These would not be sufficient for lineage tracking.
 * https://issues.apache.org/jira/browse/SPARK-18258
 * https://issues.apache.org/jira/browse/SPARK-21313

 

*Current State*
 * Method {{Sink#addBatch}} returns {{Unit}}.
 * {{StreamingQueryProgress}} reports {{offsetSeq}} start and end using 
{{sourceProgress}} value but {{sinkProgress}} only calls {{toString}} method.

{code:java}
  "sources" : [ {
    "description" : "KafkaSource[Subscribe[test-topic]]",
    "startOffset" : null,
    "endOffset" : { "test-topic" : { "0" : 5000 }},
    "numInputRows" : 5000,
    "processedRowsPerSecond" : 645.3278265358803
  } ],
  "sink" : {
    "description" : 
"org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f"
  }
{code}
 

 

*Proposed State*
 * {{Sink#addBatch}} to return {{OffsetSeq}} or {{StreamProgress}} specifying 
offsets of the written batch, e.g. Kafka does it by returning 
{{RecordMetadata}} object from {{send}} method.
 * {{StreamingQueryProgress}} incorporate {{sinkProgress}} in similar fashion 
as {{sourceProgress}}.

 

 
{code:java}
  "sources" : [ {
    "description" : "KafkaSource[Subscribe[test-topic]]",
    "startOffset" : null,
    "endOffset" : { "test-topic" : { "0" : 5000 }},
    "numInputRows" : 5000,
    "processedRowsPerSecond" : 645.3278265358803
  } ],
  "sink" : {
    "description" : 
"org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f",
   "startOffset" : null,
    "endOffset" { "sinkTopic": { "0": 333 }}
  }
{code}
 

*Implementation*
* PR submitters: Likely will be me and [~wajda] as soon as the discussion ends 
positively. 
 * {{Sinks}}: Modify all sinks to conform a new interface or return dummy 
values

[jira] [Updated] (SPARK-24647) Sink Should Return Writen Offsets For ProgressReporting

2018-07-26 Thread Vaclav Kosar (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vaclav Kosar updated SPARK-24647:
-
Summary: Sink Should Return Writen Offsets For ProgressReporting  (was: 
Sink Should Return OffsetSeqs For ProgressReporting)

> Sink Should Return Writen Offsets For ProgressReporting
> ---
>
> Key: SPARK-24647
> URL: https://issues.apache.org/jira/browse/SPARK-24647
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Vaclav Kosar
>Priority: Major
>
> To be able to track data lineage for Structured Streaming (I intend to 
> implement this to Open Source Project Spline), the monitoring needs to be 
> able to not only to track where the data was read from but also where results 
> were written to. This could be to my knowledge best implemented using 
> monitoring {{StreamingQueryProgress}}. However currently batch data offsets 
> are not available on {{Sink}} interface. Implementing as proposed would also 
> bring symmetry to {{StreamingQueryProgress}} fields sources and sink.
>  
> *Similar Proposals*
> Made in following jiras. These would not be sufficient for lineage tracking.
>  * https://issues.apache.org/jira/browse/SPARK-18258
>  * https://issues.apache.org/jira/browse/SPARK-21313
>  
> *Current State*
>  * Method {{Sink#addBatch}} returns {{Unit}}.
>  * {{StreamingQueryProgress}} reports {{offsetSeq}} start and end using 
> {{sourceProgress}} value but {{sinkProgress}} only calls {{toString}} method.
> {code:java}
>   "sources" : [ {
>     "description" : "KafkaSource[Subscribe[test-topic]]",
>     "startOffset" : null,
>     "endOffset" : { "test-topic" : { "0" : 5000 }},
>     "numInputRows" : 5000,
>     "processedRowsPerSecond" : 645.3278265358803
>   } ],
>   "sink" : {
>     "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f"
>   }
> {code}
>  
>  
> *Proposed State*
>  * {{Sink#addBatch}} to return {{OffsetSeq}} or {{StreamProgress}} specifying 
> offsets of the written batch, e.g. Kafka does it by returning 
> {{RecordMetadata}} object from {{send}} method.
>  * {{StreamingQueryProgress}} incorporate {{sinkProgress}} in similar fashion 
> as {{sourceProgress}}.
>  
>  
> {code:java}
>   "sources" : [ {
>     "description" : "KafkaSource[Subscribe[test-topic]]",
>     "startOffset" : null,
>     "endOffset" : { "test-topic" : { "0" : 5000 }},
>     "numInputRows" : 5000,
>     "processedRowsPerSecond" : 645.3278265358803
>   } ],
>   "sink" : {
>     "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleSink@9da556f",
>    "startOffset" : null,
>     "endOffset" { "sinkTopic": { "0": 333 }}
>   }
> {code}
>  
> *Implementation*
> * PR submitters: Likely will be me and [~wajda] as soon as the discussion 
> ends positively. 
>  * {{Sinks}}: Modify all sinks to conform a new interface or return dummy 
> values.
>  * {{ProgressReporter}}: Merge offsets from different batches properly, 
> similarly to how it is done for sources.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24897) DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for stage fetchFailed

2018-07-26 Thread liupengcheng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558144#comment-16558144
 ] 

liupengcheng commented on SPARK-24897:
--

already fixed at 2.x

> DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for 
> stage fetchFailed
> --
>
> Key: SPARK-24897
> URL: https://issues.apache.org/jira/browse/SPARK-24897
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.1
>Reporter: liupengcheng
>Priority: Major
>
> In Spark2.1, when a stage fetchfailed DAGScheduler will retry both this stage 
> and it's parent stage, however, when the parent stage is resubmitted and 
> start running, the mapstatuses can 
> still be invalidate by the stage's outstanding task due to fetchfailed.
> The stage's outstanding task might unregister the mapstatuses with new epoch, 
> thus causing 
> the parent stage repeated MetadataFetchFailed and finally failling the Job.
>  
>  
> {code:java}
> 2018-07-23,01:52:33,012 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
> task 174.0 in stage 71.0 (TID 154127, , executor 96): 
> FetchFailed(BlockManagerId(4945, , 22409), shuffleId=24, mapId=667, 
> reduceId=174, message= org.apache.spark.shuffle.FetchFailedException: Failed 
> to connect to /:22409
> 2018-07-23,01:52:33,013 INFO org.apache.spark.scheduler.DAGScheduler: Marking 
> ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) as failed due to a 
> fetch failure from ShuffleMapStage 69 ($plus$plus at 
> DeviceLocateMain.scala:236) 
> 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: 
> ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) failed in 246.856 s 
> 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: 
> Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) 
> and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch 
> failure 
> 2018-07-23,01:52:36,004 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
> task 120.0 in stage 71.0 (TID 154073, , executor 286): 
> FetchFailed(BlockManagerId(4208, , 22409), shuffleId=24, mapId=241, 
> reduceId=120, message= org.apache.spark.shuffle.FetchFailedException: Failed 
> to connect to /:22409 
> 2018-07-23,01:52:36,005 INFO org.apache.spark.scheduler.DAGScheduler: 
> Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) 
> and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch 
> failure 
> 2018-07-23,01:52:36,017 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece0 stored as bytes in memory (estimated size 4.0 MB, free 
> 26.7 MB) 
> 2018-07-23,01:52:36,025 INFO org.apache.spark.storage.BlockManagerInfo: 
> Removed broadcast_59_piece1 on :52349 in memory (size: 4.0 MB, free: 
> 3.0 GB) 
> 2018-07-23,01:52:36,029 INFO org.apache.spark.storage.BlockManagerInfo: 
> Removed broadcast_61_piece6 on :52349 in memory (size: 4.0 MB, free: 
> 3.0 GB) 
> 2018-07-23,01:52:36,079 INFO org.apache.spark.deploy.yarn.YarnAllocator: 
> Canceling requests for 0 executor containers 
> 2018-07-23,01:52:36,079 WARN org.apache.spark.deploy.yarn.YarnAllocator: 
> Expected to find pending requests, but found none.
>  2018-07-23,01:52:36,094 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_63_piece0 in memory on :56780 (size: 4.0 MB, free: 3.7 
> GB) 
> 2018-07-23,01:52:36,095 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece1 stored as bytes in memory (estimated size 4.0 MB, free 
> 30.7 MB) 
> 2018-07-23,01:52:36,107 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece1 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece2 stored as bytes in memory (estimated size 4.0 MB, free 
> 34.7 MB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece2 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece3 stored as bytes in memory (estimated size 4.0 MB, free 
> 38.7 MB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece3 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece4 stored as bytes in memory (estimated size 3.8 MB, free 
> 42.5 MB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece4 in memory on :56780 (size: 3.8 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.MapOutputTracker: Broadcast 
> mapstatu

[jira] [Updated] (SPARK-24897) DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for stage fetchFailed

2018-07-26 Thread liupengcheng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng updated SPARK-24897:
-
Affects Version/s: (was: 2.3.1)
   (was: 2.1.0)
   1.6.1

> DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for 
> stage fetchFailed
> --
>
> Key: SPARK-24897
> URL: https://issues.apache.org/jira/browse/SPARK-24897
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.1
>Reporter: liupengcheng
>Priority: Major
>
> In Spark2.1, when a stage fetchfailed DAGScheduler will retry both this stage 
> and it's parent stage, however, when the parent stage is resubmitted and 
> start running, the mapstatuses can 
> still be invalidate by the stage's outstanding task due to fetchfailed.
> The stage's outstanding task might unregister the mapstatuses with new epoch, 
> thus causing 
> the parent stage repeated MetadataFetchFailed and finally failling the Job.
>  
>  
> {code:java}
> 2018-07-23,01:52:33,012 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
> task 174.0 in stage 71.0 (TID 154127, , executor 96): 
> FetchFailed(BlockManagerId(4945, , 22409), shuffleId=24, mapId=667, 
> reduceId=174, message= org.apache.spark.shuffle.FetchFailedException: Failed 
> to connect to /:22409
> 2018-07-23,01:52:33,013 INFO org.apache.spark.scheduler.DAGScheduler: Marking 
> ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) as failed due to a 
> fetch failure from ShuffleMapStage 69 ($plus$plus at 
> DeviceLocateMain.scala:236) 
> 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: 
> ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) failed in 246.856 s 
> 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: 
> Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) 
> and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch 
> failure 
> 2018-07-23,01:52:36,004 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
> task 120.0 in stage 71.0 (TID 154073, , executor 286): 
> FetchFailed(BlockManagerId(4208, , 22409), shuffleId=24, mapId=241, 
> reduceId=120, message= org.apache.spark.shuffle.FetchFailedException: Failed 
> to connect to /:22409 
> 2018-07-23,01:52:36,005 INFO org.apache.spark.scheduler.DAGScheduler: 
> Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) 
> and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch 
> failure 
> 2018-07-23,01:52:36,017 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece0 stored as bytes in memory (estimated size 4.0 MB, free 
> 26.7 MB) 
> 2018-07-23,01:52:36,025 INFO org.apache.spark.storage.BlockManagerInfo: 
> Removed broadcast_59_piece1 on :52349 in memory (size: 4.0 MB, free: 
> 3.0 GB) 
> 2018-07-23,01:52:36,029 INFO org.apache.spark.storage.BlockManagerInfo: 
> Removed broadcast_61_piece6 on :52349 in memory (size: 4.0 MB, free: 
> 3.0 GB) 
> 2018-07-23,01:52:36,079 INFO org.apache.spark.deploy.yarn.YarnAllocator: 
> Canceling requests for 0 executor containers 
> 2018-07-23,01:52:36,079 WARN org.apache.spark.deploy.yarn.YarnAllocator: 
> Expected to find pending requests, but found none.
>  2018-07-23,01:52:36,094 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_63_piece0 in memory on :56780 (size: 4.0 MB, free: 3.7 
> GB) 
> 2018-07-23,01:52:36,095 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece1 stored as bytes in memory (estimated size 4.0 MB, free 
> 30.7 MB) 
> 2018-07-23,01:52:36,107 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece1 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece2 stored as bytes in memory (estimated size 4.0 MB, free 
> 34.7 MB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece2 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece3 stored as bytes in memory (estimated size 4.0 MB, free 
> 38.7 MB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece3 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece4 stored as bytes in memory (estimated size 3.8 MB, free 
> 42.5 MB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece4 in memory on :56780 (size: 3.8 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.MapOutput

[jira] [Resolved] (SPARK-24897) DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for stage fetchFailed

2018-07-26 Thread liupengcheng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng resolved SPARK-24897.
--
Resolution: Invalid

> DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for 
> stage fetchFailed
> --
>
> Key: SPARK-24897
> URL: https://issues.apache.org/jira/browse/SPARK-24897
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.1.0, 2.3.1
>Reporter: liupengcheng
>Priority: Major
>
> In Spark2.1, when a stage fetchfailed DAGScheduler will retry both this stage 
> and it's parent stage, however, when the parent stage is resubmitted and 
> start running, the mapstatuses can 
> still be invalidate by the stage's outstanding task due to fetchfailed.
> The stage's outstanding task might unregister the mapstatuses with new epoch, 
> thus causing 
> the parent stage repeated MetadataFetchFailed and finally failling the Job.
>  
>  
> {code:java}
> 2018-07-23,01:52:33,012 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
> task 174.0 in stage 71.0 (TID 154127, , executor 96): 
> FetchFailed(BlockManagerId(4945, , 22409), shuffleId=24, mapId=667, 
> reduceId=174, message= org.apache.spark.shuffle.FetchFailedException: Failed 
> to connect to /:22409
> 2018-07-23,01:52:33,013 INFO org.apache.spark.scheduler.DAGScheduler: Marking 
> ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) as failed due to a 
> fetch failure from ShuffleMapStage 69 ($plus$plus at 
> DeviceLocateMain.scala:236) 
> 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: 
> ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) failed in 246.856 s 
> 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: 
> Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) 
> and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch 
> failure 
> 2018-07-23,01:52:36,004 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
> task 120.0 in stage 71.0 (TID 154073, , executor 286): 
> FetchFailed(BlockManagerId(4208, , 22409), shuffleId=24, mapId=241, 
> reduceId=120, message= org.apache.spark.shuffle.FetchFailedException: Failed 
> to connect to /:22409 
> 2018-07-23,01:52:36,005 INFO org.apache.spark.scheduler.DAGScheduler: 
> Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) 
> and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch 
> failure 
> 2018-07-23,01:52:36,017 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece0 stored as bytes in memory (estimated size 4.0 MB, free 
> 26.7 MB) 
> 2018-07-23,01:52:36,025 INFO org.apache.spark.storage.BlockManagerInfo: 
> Removed broadcast_59_piece1 on :52349 in memory (size: 4.0 MB, free: 
> 3.0 GB) 
> 2018-07-23,01:52:36,029 INFO org.apache.spark.storage.BlockManagerInfo: 
> Removed broadcast_61_piece6 on :52349 in memory (size: 4.0 MB, free: 
> 3.0 GB) 
> 2018-07-23,01:52:36,079 INFO org.apache.spark.deploy.yarn.YarnAllocator: 
> Canceling requests for 0 executor containers 
> 2018-07-23,01:52:36,079 WARN org.apache.spark.deploy.yarn.YarnAllocator: 
> Expected to find pending requests, but found none.
>  2018-07-23,01:52:36,094 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_63_piece0 in memory on :56780 (size: 4.0 MB, free: 3.7 
> GB) 
> 2018-07-23,01:52:36,095 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece1 stored as bytes in memory (estimated size 4.0 MB, free 
> 30.7 MB) 
> 2018-07-23,01:52:36,107 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece1 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece2 stored as bytes in memory (estimated size 4.0 MB, free 
> 34.7 MB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece2 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece3 stored as bytes in memory (estimated size 4.0 MB, free 
> 38.7 MB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece3 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece4 stored as bytes in memory (estimated size 3.8 MB, free 
> 42.5 MB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece4 in memory on :56780 (size: 3.8 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.MapOutputTracker: Broadcast 
> mapstatuses size = 384, actual size = 20784475 
> 2018

[jira] [Updated] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`

2018-07-26 Thread Xiaochen Ouyang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochen Ouyang updated SPARK-24930:

Description: 
# root user create a test.txt file contains a record '123'  in /root/ directory
 # switch mr user to execute spark-shell --master local

{code:java}
scala> spark.version
res2: String = 2.2.1

scala> spark.sql("create table t1(id int) partitioned by(area string)");
2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: 
Location: hdfs://nameservice/spark/t1 specified for non-external table:t1
res4: org.apache.spark.sql.DataFrame = []


scala> spark.sql("load data local inpath '/root/test.txt' into table t1 
partition(area ='025')")
org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
/root/test.txt;
 at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
 at org.apache.spark.sql.Dataset.(Dataset.scala:183)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639)
 ... 48 elided

scala>
{code}
In fact, the input path exists, but the mr user does not have permission to 
access the directory `/root/` ,so the message throwed by `AnalysisException` 
can confuse user.

  was:
# root user create a test.txt file contains a record '123'  in /root/ directory
 # switch mr user to execute spark-shell --master local

{code:java}
scala> spark.version
res2: String = 2.2.1

scala> spark.sql("create table t1(id int) partitioned by(area string)");
2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: 
Location: hdfs://nameservice/spark/t1 specified for non-external table:t1
res4: org.apache.spark.sql.DataFrame = []


scala> spark.sql("load data local inpath '/root/test.txt' into table t1 
partition(area ='025')")
org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
/root/test.txt;
 at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
 at org.apache.spark.sql.Dataset.(Dataset.scala:183)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639)
 ... 48 elided

scala>
{code}
In fact, the input path exists, but the mr user does not have permission to 
access the directory `/root/` ,so the message throwed by `AnalysisException` 
can misleading user fix problem.


>  Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
> --
>
> Key: SPARK-24930
> URL: https://issues.apache.org/jira/browse/SPARK-24930
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Xiaochen Ouyang
>Priority: Major
>
> # root user create a test.txt file contains a record '123'  in /root/ 
> directory
>  # switch mr user to execute spark-shell --master local
> {code:java}
> scala> spark.version
> res2: String = 2.2.1
> scala> spark.sql("create table t1(id int) partitioned by(area string)");
> 2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: 
> Location: hdfs://nameservice/spark/t1 specified for non-external table:t1
> res4: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("load data local inpath '/root/test.txt' into table t1 
> partition(area ='025')")
> org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
> /root/test.txt;
>  at 
> org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639)
>  ... 48 elided
> scala>
> {code}
> In fact, the input path exists, but the mr user does not have permission to 
> access the directory `/root/` ,so the messag

[jira] [Updated] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`

2018-07-26 Thread Xiaochen Ouyang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochen Ouyang updated SPARK-24930:

Description: 
# root user create a test.txt file contains a record '123'  in /root/ directory
 # switch mr user to execute spark-shell --master local

{code:java}
scala> spark.version
res2: String = 2.2.1

scala> spark.sql("create table t1(id int) partitioned by(area string)");
2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: 
Location: hdfs://nameservice/spark/t1 specified for non-external table:t1
res4: org.apache.spark.sql.DataFrame = []


scala> spark.sql("load data local inpath '/root/test.txt' into table t1 
partition(area ='025')")
org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
/root/test.txt;
 at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
 at org.apache.spark.sql.Dataset.(Dataset.scala:183)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639)
 ... 48 elided

scala>
{code}
In fact, the input path exists, but the mr user does not have permission to 
access the directory `/root/` ,so the message throwed by `AnalysisException` 
can misleading user fix problem.

  was:
#  root user  create a test.txt file contains a record '123'  in /root/ 
directory
# switch mr user to execute spark-shell --master local

{code:java}
scala> spark.version
res2: String = 2.2.1

scala> spark.sql("load data local inpath '/root/test.txt' into table t1 
partition(area ='025')")
org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
/root/test.txt;
 at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
 at org.apache.spark.sql.Dataset.(Dataset.scala:183)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639)
 ... 48 elided

scala>
{code}

In fact, the input path exists, but the mr user does not have permission to 
access the directory `/root/` ,so the message throwed by `AnalysisException` 
can misleading user fix problem.


>  Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
> --
>
> Key: SPARK-24930
> URL: https://issues.apache.org/jira/browse/SPARK-24930
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Xiaochen Ouyang
>Priority: Major
>
> # root user create a test.txt file contains a record '123'  in /root/ 
> directory
>  # switch mr user to execute spark-shell --master local
> {code:java}
> scala> spark.version
> res2: String = 2.2.1
> scala> spark.sql("create table t1(id int) partitioned by(area string)");
> 2018-07-26 17:20:37,523 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: 
> Location: hdfs://nameservice/spark/t1 specified for non-external table:t1
> res4: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("load data local inpath '/root/test.txt' into table t1 
> partition(area ='025')")
> org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
> /root/test.txt;
>  at 
> org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639)
>  ... 48 elided
> scala>
> {code}
> In fact, the input path exists, but the mr user does not have permission to 
> access the directory `/root/` ,so the message throwed by `AnalysisException` 
> can misleading user fix problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.a

[jira] [Assigned] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`

2018-07-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24930:


Assignee: (was: Apache Spark)

>  Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
> --
>
> Key: SPARK-24930
> URL: https://issues.apache.org/jira/browse/SPARK-24930
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Xiaochen Ouyang
>Priority: Major
>
> #  root user  create a test.txt file contains a record '123'  in /root/ 
> directory
> # switch mr user to execute spark-shell --master local
> {code:java}
> scala> spark.version
> res2: String = 2.2.1
> scala> spark.sql("load data local inpath '/root/test.txt' into table t1 
> partition(area ='025')")
> org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
> /root/test.txt;
>  at 
> org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639)
>  ... 48 elided
> scala>
> {code}
> In fact, the input path exists, but the mr user does not have permission to 
> access the directory `/root/` ,so the message throwed by `AnalysisException` 
> can misleading user fix problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`

2018-07-26 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558090#comment-16558090
 ] 

Apache Spark commented on SPARK-24930:
--

User 'ouyangxiaochen' has created a pull request for this issue:
https://github.com/apache/spark/pull/21881

>  Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
> --
>
> Key: SPARK-24930
> URL: https://issues.apache.org/jira/browse/SPARK-24930
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Xiaochen Ouyang
>Priority: Major
>
> #  root user  create a test.txt file contains a record '123'  in /root/ 
> directory
> # switch mr user to execute spark-shell --master local
> {code:java}
> scala> spark.version
> res2: String = 2.2.1
> scala> spark.sql("load data local inpath '/root/test.txt' into table t1 
> partition(area ='025')")
> org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
> /root/test.txt;
>  at 
> org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639)
>  ... 48 elided
> scala>
> {code}
> In fact, the input path exists, but the mr user does not have permission to 
> access the directory `/root/` ,so the message throwed by `AnalysisException` 
> can misleading user fix problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`

2018-07-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24930:


Assignee: Apache Spark

>  Exception information is not accurate when using `LOAD DATA LOCAL INPATH`
> --
>
> Key: SPARK-24930
> URL: https://issues.apache.org/jira/browse/SPARK-24930
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Xiaochen Ouyang
>Assignee: Apache Spark
>Priority: Major
>
> #  root user  create a test.txt file contains a record '123'  in /root/ 
> directory
> # switch mr user to execute spark-shell --master local
> {code:java}
> scala> spark.version
> res2: String = 2.2.1
> scala> spark.sql("load data local inpath '/root/test.txt' into table t1 
> partition(area ='025')")
> org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
> /root/test.txt;
>  at 
> org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639)
>  ... 48 elided
> scala>
> {code}
> In fact, the input path exists, but the mr user does not have permission to 
> access the directory `/root/` ,so the message throwed by `AnalysisException` 
> can misleading user fix problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12911) Cacheing a dataframe causes array comparisons to fail (in filter / where) after 1.6

2018-07-26 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-12911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558087#comment-16558087
 ] 

Hyukjin Kwon commented on SPARK-12911:
--

Looks indeed similar. Mind if I ask to open another JIRA for it separately? 
Symptom looks similar but difficult to judge if they are actually same or not. 

> Cacheing a dataframe causes array comparisons to fail (in filter / where) 
> after 1.6
> ---
>
> Key: SPARK-12911
> URL: https://issues.apache.org/jira/browse/SPARK-12911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.6.0
>Reporter: Jesse English
>Priority: Major
>
> When doing a *where* operation on a dataframe and testing for equality on an 
> array type, after 1.6 no valid comparisons are made if the dataframe has been 
> cached.  If it has not been cached, the results are as expected.
> This appears to be related to the underlying unsafe array data types.
> {code:title=test.scala|borderStyle=solid}
> test("test array comparison") {
> val vectors: Vector[Row] =  Vector(
>   Row.fromTuple("id_1" -> Array(0L, 2L)),
>   Row.fromTuple("id_2" -> Array(0L, 5L)),
>   Row.fromTuple("id_3" -> Array(0L, 9L)),
>   Row.fromTuple("id_4" -> Array(1L, 0L)),
>   Row.fromTuple("id_5" -> Array(1L, 8L)),
>   Row.fromTuple("id_6" -> Array(2L, 4L)),
>   Row.fromTuple("id_7" -> Array(5L, 6L)),
>   Row.fromTuple("id_8" -> Array(6L, 2L)),
>   Row.fromTuple("id_9" -> Array(7L, 0L))
> )
> val data: RDD[Row] = sc.parallelize(vectors, 3)
> val schema = StructType(
>   StructField("id", StringType, false) ::
> StructField("point", DataTypes.createArrayType(LongType, false), 
> false) ::
> Nil
> )
> val sqlContext = new SQLContext(sc)
> val dataframe = sqlContext.createDataFrame(data, schema)
> val targetPoint:Array[Long] = Array(0L,9L)
> //Cacheing is the trigger to cause the error (no cacheing causes no error)
> dataframe.cache()
> //This is the line where it fails
> //java.util.NoSuchElementException: next on empty iterator
> //However we know that there is a valid match
> val targetRow = dataframe.where(dataframe("point") === 
> array(targetPoint.map(value => lit(value)): _*)).first()
> assert(targetRow != null)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24930) Exception information is not accurate when using `LOAD DATA LOCAL INPATH`

2018-07-26 Thread Xiaochen Ouyang (JIRA)

Xiaochen Ouyang created SPARK-24930:
---

 Summary:  Exception information is not accurate when using `LOAD 
DATA LOCAL INPATH`
 Key: SPARK-24930
 URL: https://issues.apache.org/jira/browse/SPARK-24930
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1, 2.3.0, 2.2.2, 2.2.1
Reporter: Xiaochen Ouyang


#  root user  create a test.txt file contains a record '123'  in /root/ 
directory
# switch mr user to execute spark-shell --master local

{code:java}
scala> spark.version
res2: String = 2.2.1

scala> spark.sql("load data local inpath '/root/test.txt' into table t1 
partition(area ='025')")
org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
/root/test.txt;
 at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:339)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
 at org.apache.spark.sql.Dataset.(Dataset.scala:183)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:639)
 ... 48 elided

scala>
{code}

In fact, the input path exists, but the mr user does not have permission to 
access the directory `/root/` ,so the message throwed by `AnalysisException` 
can misleading user fix problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24928) spark sql cross join running time too long

2018-07-26 Thread LIFULONG (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LIFULONG updated SPARK-24928:
-
Description: 
spark sql running time is too long while input left table and right table is 
small hdfs text format data,

the sql is:  select * from t1 cross join t2  

the line of t1 is 49, three column

the line of t2 is 1, one column only

running more than 30mins and then failed

 

 

spark CartesianRDD also has the same problem, example test code is:

val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 1 
column
 val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
line 3 column
 val cartesian = new CartesianRDD(sc, twos, ones)

cartesian.count()

running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
less than 10 seconds

  was:
spark sql running time is too long while input left table and right table is 
small text format data,

the sql is:  select * from t1 cross join t2  

the line of t1 is 49, three column

the line of t2 is 1, one column only

running more than 30mins and then failed

 

 

spark CartesianRDD also has the same problem, example test code is:

val ones = sc.textFile("file:///Users/moses/4paradigm/data/cartesian_data/t1b") 
 //1 line 1 column
 val twos = 
sc.textFile("file:///Users/moses/4paradigm/data/cartesian_data/t2b")  //49 
line 3 column
val cartesian = new CartesianRDD(sc, twos, ones)

cartesian.count()

running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
less than 10 seconds


> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24929) Merge script swallow KeyboardInterrupt

2018-07-26 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558065#comment-16558065
 ] 

Apache Spark commented on SPARK-24929:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/21880

> Merge script swallow KeyboardInterrupt
> --
>
> Key: SPARK-24929
> URL: https://issues.apache.org/jira/browse/SPARK-24929
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> If I want to get out of the loop to assign JIRA's user by command+c 
> (KeyboardInterrupt), I am unable to get out as below:
> {code}
> Error assigning JIRA, try again (or leave blank and fix manually)
> JIRA is unassigned, choose assignee
> [0] todd.chen (Reporter)
> Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):Traceback (most recent call last):
>   File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee
> "Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):")
> KeyboardInterrupt
> Error assigning JIRA, try again (or leave blank and fix manually)
> JIRA is unassigned, choose assignee
> [0] todd.chen (Reporter)
> Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):Traceback (most recent call last):
>   File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee
> "Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):")
> KeyboardInterrupt
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24929) Merge script swallow KeyboardInterrupt

2018-07-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24929:


Assignee: (was: Apache Spark)

> Merge script swallow KeyboardInterrupt
> --
>
> Key: SPARK-24929
> URL: https://issues.apache.org/jira/browse/SPARK-24929
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> If I want to get out of the loop to assign JIRA's user by command+c 
> (KeyboardInterrupt), I am unable to get out as below:
> {code}
> Error assigning JIRA, try again (or leave blank and fix manually)
> JIRA is unassigned, choose assignee
> [0] todd.chen (Reporter)
> Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):Traceback (most recent call last):
>   File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee
> "Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):")
> KeyboardInterrupt
> Error assigning JIRA, try again (or leave blank and fix manually)
> JIRA is unassigned, choose assignee
> [0] todd.chen (Reporter)
> Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):Traceback (most recent call last):
>   File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee
> "Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):")
> KeyboardInterrupt
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24929) Merge script swallow KeyboardInterrupt

2018-07-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24929:


Assignee: Apache Spark

> Merge script swallow KeyboardInterrupt
> --
>
> Key: SPARK-24929
> URL: https://issues.apache.org/jira/browse/SPARK-24929
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Trivial
>
> If I want to get out of the loop to assign JIRA's user by command+c 
> (KeyboardInterrupt), I am unable to get out as below:
> {code}
> Error assigning JIRA, try again (or leave blank and fix manually)
> JIRA is unassigned, choose assignee
> [0] todd.chen (Reporter)
> Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):Traceback (most recent call last):
>   File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee
> "Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):")
> KeyboardInterrupt
> Error assigning JIRA, try again (or leave blank and fix manually)
> JIRA is unassigned, choose assignee
> [0] todd.chen (Reporter)
> Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):Traceback (most recent call last):
>   File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee
> "Enter number of user, or userid,  to assign to (blank to leave 
> unassigned):")
> KeyboardInterrupt
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24928) spark sql cross join running time too long

2018-07-26 Thread LIFULONG (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LIFULONG updated SPARK-24928:
-
Priority: Minor  (was: Major)

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = 
> sc.textFile("file:///Users/moses/4paradigm/data/cartesian_data/t1b")  //1 
> line 1 column
>  val twos = 
> sc.textFile("file:///Users/moses/4paradigm/data/cartesian_data/t2b")  
> //49 line 3 column
> val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24929) Merge script swallow KeyboardInterrupt

2018-07-26 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-24929:


 Summary: Merge script swallow KeyboardInterrupt
 Key: SPARK-24929
 URL: https://issues.apache.org/jira/browse/SPARK-24929
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 2.4.0
Reporter: Hyukjin Kwon


If I want to get out of the loop to assign JIRA's user by command+c 
(KeyboardInterrupt), I am unable to get out as below:

{code}
Error assigning JIRA, try again (or leave blank and fix manually)
JIRA is unassigned, choose assignee
[0] todd.chen (Reporter)
Enter number of user, or userid,  to assign to (blank to leave 
unassigned):Traceback (most recent call last):
  File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee
"Enter number of user, or userid,  to assign to (blank to leave 
unassigned):")
KeyboardInterrupt
Error assigning JIRA, try again (or leave blank and fix manually)
JIRA is unassigned, choose assignee
[0] todd.chen (Reporter)
Enter number of user, or userid,  to assign to (blank to leave 
unassigned):Traceback (most recent call last):
  File "./dev/merge_spark_pr.py", line 322, in choose_jira_assignee
"Enter number of user, or userid,  to assign to (blank to leave 
unassigned):")
KeyboardInterrupt
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24924) Add mapping for built-in Avro data source

2018-07-26 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24924.
--
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/21878

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24924) Add mapping for built-in Avro data source

2018-07-26 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24924:
-
Fix Version/s: 2.4.0

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24924) Add mapping for built-in Avro data source

2018-07-26 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24924:


Assignee: Dongjoon Hyun

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24928) spark sql cross join running time too long

2018-07-26 Thread LIFULONG (JIRA)

LIFULONG created SPARK-24928:


 Summary: spark sql cross join running time too long
 Key: SPARK-24928
 URL: https://issues.apache.org/jira/browse/SPARK-24928
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 1.6.2
Reporter: LIFULONG


spark sql running time is too long while input left table and right table is 
small text format data,

the sql is:  select * from t1 cross join t2  

the line of t1 is 49, three column

the line of t2 is 1, one column only

running more than 30mins and then failed

 

 

spark CartesianRDD also has the same problem, example test code is:

val ones = sc.textFile("file:///Users/moses/4paradigm/data/cartesian_data/t1b") 
 //1 line 1 column
 val twos = 
sc.textFile("file:///Users/moses/4paradigm/data/cartesian_data/t2b")  //49 
line 3 column
val cartesian = new CartesianRDD(sc, twos, ones)

cartesian.count()

running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21436) Take advantage of known partioner for distinct on RDDs

2018-07-26 Thread zhengruifeng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558013#comment-16558013
 ] 

zhengruifeng edited comment on SPARK-21436 at 7/26/18 7:38 AM:
---

[~holdenk] It looks like that {{distinct}} already {color:#33}utilized the 
known{color} partitioner. \{{distinct}} calls 
{color:#33}{{{color:#ffc66d}combineByKeyWithClassTag}} internally, and will 
avoid the shuffle if rdd's partitioner is equal to a hash partitioner with same 
#partitions.{color}{color}

 

Or do you mean that we need to expose some API like {{distinct(partitioner: 
Partitioner)}} for other kinds of partitioners like \{{RangePartitioner}}?


was (Author: podongfeng):
[~holdenk] It looks like that \{{distinct}} already utilized the known 
partitioner. \{distinct} calls 
\{{color:#ffc66d}{color:#33}combineByKeyWithClassTag} internally, and will 
avoid the shuffle if rdd's partitioner is equal to a hash partitioner with same 
#partitions.{color}
 
  
 
 Or do you mean that we need to expose some API like \{distinct(partitioner: 
Partitioner)} for other kinds of partitioners like \{RangePartitioner}?

> Take advantage of known partioner for distinct on RDDs
> --
>
> Key: SPARK-21436
> URL: https://issues.apache.org/jira/browse/SPARK-21436
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Minor
>
> If we have a known partitioner we should be able to avoid the shuffle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21436) Take advantage of known partioner for distinct on RDDs

2018-07-26 Thread zhengruifeng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558013#comment-16558013
 ] 

zhengruifeng edited comment on SPARK-21436 at 7/26/18 7:37 AM:
---

[~holdenk] It looks like that \{{distinct}} already utilized the known 
partitioner. \{distinct} calls 
\{{color:#ffc66d}{color:#33}combineByKeyWithClassTag} internally, and will 
avoid the shuffle if rdd's partitioner is equal to a hash partitioner with same 
#partitions.{color}
 
  
 
 Or do you mean that we need to expose some API like \{distinct(partitioner: 
Partitioner)} for other kinds of partitioners like \{RangePartitioner}?


was (Author: podongfeng):
[~holdenk] It looks like that \{distinct} already utilized the known 
partitioner. \{distinct} calls 
{{color:#ffc66d}{color:#33}combineByKeyWithClassTag} internally, and will 
avoid the shuffle if rdd's partitioner is equal to a hash partitioner with same 
#partitions.
{color}{color}

 

{color:#ffc66d}{color:#33}Or do you mean that we need to expose some API 
like {distinct(partitioner: {color}{color}Partitioner)} for other kinds of 
partitioners like \{RangePartitioner}?

> Take advantage of known partioner for distinct on RDDs
> --
>
> Key: SPARK-21436
> URL: https://issues.apache.org/jira/browse/SPARK-21436
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Minor
>
> If we have a known partitioner we should be able to avoid the shuffle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21436) Take advantage of known partioner for distinct on RDDs

2018-07-26 Thread zhengruifeng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558013#comment-16558013
 ] 

zhengruifeng commented on SPARK-21436:
--

[~holdenk] It looks like that \{distinct} already utilized the known 
partitioner. \{distinct} calls 
{{color:#ffc66d}{color:#33}combineByKeyWithClassTag} internally, and will 
avoid the shuffle if rdd's partitioner is equal to a hash partitioner with same 
#partitions.
{color}{color}

 

{color:#ffc66d}{color:#33}Or do you mean that we need to expose some API 
like {distinct(partitioner: {color}{color}Partitioner)} for other kinds of 
partitioners like \{RangePartitioner}?

> Take advantage of known partioner for distinct on RDDs
> --
>
> Key: SPARK-21436
> URL: https://issues.apache.org/jira/browse/SPARK-21436
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Minor
>
> If we have a known partitioner we should be able to avoid the shuffle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23683) FileCommitProtocol.instantiate to require 3-arg constructor for dynamic partition overwrite

2018-07-26 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558009#comment-16558009
 ] 

Felix Cheung commented on SPARK-23683:
--

should this be ported back to branch 2.3 ? we ran into this problem and the 
original change seems to be in 2.3.0 release SPARK-20236

> FileCommitProtocol.instantiate to require 3-arg constructor for dynamic 
> partition overwrite
> ---
>
> Key: SPARK-23683
> URL: https://issues.apache.org/jira/browse/SPARK-23683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Fix For: 2.4.0
>
>
> with SPARK-20236 {{FileCommitProtocol.instantiate()}} looks for a three 
> argument constructor, passing in the {{dynamicPartitionOverwrite}} parameter. 
> If there is no such constructor, it falls back to the classic two-arg one.
> When {{InsertIntoHadoopFsRelationCommand}} passes down that 
> {{dynamicPartitionOverwrite}} flag to  {{FileCommitProtocol.instantiate()}}, 
> it _assumes_ that the instantiated protocol supports the specific 
> requirements of dynamic partition overwrite. It does not notice when this 
> does not hold, and so the output generated may be incorrect.
> Proposed: when dynamicPartitionOverwrite == true, require the protocol 
> implementation to have a 3-arg constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24878) Fix reverse function for array type of primitive type containing null.

2018-07-26 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24878.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21830
[https://github.com/apache/spark/pull/21830]

> Fix reverse function for array type of primitive type containing null.
> --
>
> Key: SPARK-24878
> URL: https://issues.apache.org/jira/browse/SPARK-24878
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.0
>
>
> If we use {{reverse}} function for array type of primitive type containing 
> {{null}} and the child array is {{UnsafeArrayData}}, the function returns a 
> wrong result because {{UnsafeArrayData}} doesn't define the behavior of 
> re-assignment, especially we can't set a valid value after we set {{null}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files

2018-07-26 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16557985#comment-16557985
 ] 

Apache Spark commented on SPARK-24927:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/21879

> The hadoop-provided profile doesn't play well with Snappy-compressed Parquet 
> files
> --
>
> Key: SPARK-24927
> URL: https://issues.apache.org/jira/browse/SPARK-24927
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Cheng Lian
>Priority: Major
>
> Reproduction:
> {noformat}
> wget 
> https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
> wget 
> https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
> tar xzf spark-2.3.1-bin-without-hadoop.tgz
> tar xzf hadoop-2.7.3.tar.gz
> export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath)
> ./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local
> ...
> scala> 
> spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet")
> {noformat}
> Exception:
> {noformat}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>   ... 69 more
> Caused by: org.apache.spark.SparkException: Task failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
>   at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
>   at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
>   at 
> org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
>   at 
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
>   at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
>   at 
> org.apache.parquet.ha

[jira] [Assigned] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files

2018-07-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24927:


Assignee: Apache Spark

> The hadoop-provided profile doesn't play well with Snappy-compressed Parquet 
> files
> --
>
> Key: SPARK-24927
> URL: https://issues.apache.org/jira/browse/SPARK-24927
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Cheng Lian
>Assignee: Apache Spark
>Priority: Major
>
> Reproduction:
> {noformat}
> wget 
> https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
> wget 
> https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
> tar xzf spark-2.3.1-bin-without-hadoop.tgz
> tar xzf hadoop-2.7.3.tar.gz
> export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath)
> ./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local
> ...
> scala> 
> spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet")
> {noformat}
> Exception:
> {noformat}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>   ... 69 more
> Caused by: org.apache.spark.SparkException: Task failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
>   at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
>   at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
>   at 
> org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
>   at 
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
>   at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)

[jira] [Assigned] (SPARK-24878) Fix reverse function for array type of primitive type containing null.

2018-07-26 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24878:
---

Assignee: Takuya Ueshin

> Fix reverse function for array type of primitive type containing null.
> --
>
> Key: SPARK-24878
> URL: https://issues.apache.org/jira/browse/SPARK-24878
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>
> If we use {{reverse}} function for array type of primitive type containing 
> {{null}} and the child array is {{UnsafeArrayData}}, the function returns a 
> wrong result because {{UnsafeArrayData}} doesn't define the behavior of 
> re-assignment, especially we can't set a valid value after we set {{null}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files

2018-07-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24927:


Assignee: (was: Apache Spark)

> The hadoop-provided profile doesn't play well with Snappy-compressed Parquet 
> files
> --
>
> Key: SPARK-24927
> URL: https://issues.apache.org/jira/browse/SPARK-24927
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Cheng Lian
>Priority: Major
>
> Reproduction:
> {noformat}
> wget 
> https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
> wget 
> https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
> tar xzf spark-2.3.1-bin-without-hadoop.tgz
> tar xzf hadoop-2.7.3.tar.gz
> export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath)
> ./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local
> ...
> scala> 
> spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet")
> {noformat}
> Exception:
> {noformat}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>   ... 69 more
> Caused by: org.apache.spark.SparkException: Task failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
>   at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
>   at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
>   at 
> org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
>   at 
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
>   at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
>   at 
> org.apache.par

[jira] [Commented] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files

2018-07-26 Thread Cheng Lian (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16557603#comment-16557603
 ] 

Cheng Lian commented on SPARK-24927:


Downgraded from blocker to major, since it's not a regression. Just realized 
that this issue existed ever since at least 1.6.

> The hadoop-provided profile doesn't play well with Snappy-compressed Parquet 
> files
> --
>
> Key: SPARK-24927
> URL: https://issues.apache.org/jira/browse/SPARK-24927
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Cheng Lian
>Priority: Major
>
> Reproduction:
> {noformat}
> wget 
> https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
> wget 
> https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
> tar xzf spark-2.3.1-bin-without-hadoop.tgz
> tar xzf hadoop-2.7.3.tar.gz
> export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath)
> ./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local
> ...
> scala> 
> spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet")
> {noformat}
> Exception:
> {noformat}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>   ... 69 more
> Caused by: org.apache.spark.SparkException: Task failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
>   at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
>   at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
>   at 
> org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
>   at 
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
>   at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
>   at 
> o

1 2 >

1 - 100 of 101 matches

Mail list logo