[jira] [Created] (SPARK-27290) remove unneed sort under Aggregate

2019-03-26 Thread Xiaoju Wu (JIRA)
Xiaoju Wu created SPARK-27290:
-

 Summary: remove unneed sort under Aggregate
 Key: SPARK-27290
 URL: https://issues.apache.org/jira/browse/SPARK-27290
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Xiaoju Wu


I saw some tickets to remove unneeded sort in plan while I think there's 
another case in which sort is redundant:

Sort just under an non-orderPreserving node is redundant, for example:

select count(*) from (select a1 from A order by a2);
+- Aggregate
  +- Sort
     +- FileScan parquet

But one of the existing test cases is conflict with this example:

test("sort should not be removed when there is a node which doesn't guarantee 
any order")

{   val orderedPlan = testRelation.select('a, 'b).orderBy('a.asc)   val 
groupedAndResorted = orderedPlan.groupBy('a)(sum('a)).orderBy('a.asc)   val 
optimized = Optimize.execute(groupedAndResorted.analyze)   val correctAnswer = 
groupedAndResorted.analyze   comparePlans(optimized, correctAnswer) }

Why is it designed like this? In my opinion, since Aggregate won't pass up the 
ordering, the below Sort is useless.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27083) Add a config to control subqueryReuse

2019-03-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27083.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23998
[https://github.com/apache/spark/pull/23998]

> Add a config to control subqueryReuse
> -
>
> Key: SPARK-27083
> URL: https://issues.apache.org/jira/browse/SPARK-27083
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 3.0.0
>
>
> Subquery Reuse and Exchange Reuse are not the same feature, if we don't want 
> to reuse subqueries,and we just want to reuse exchanges,only one 
> configuration that cannot be done.
> So I think we should add a new configuration  to control subqueryReuse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27083) Add a config to control subqueryReuse

2019-03-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27083:
---

Assignee: liuxian

> Add a config to control subqueryReuse
> -
>
> Key: SPARK-27083
> URL: https://issues.apache.org/jira/browse/SPARK-27083
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
>
> Subquery Reuse and Exchange Reuse are not the same feature, if we don't want 
> to reuse subqueries,and we just want to reuse exchanges,only one 
> configuration that cannot be done.
> So I think we should add a new configuration  to control subqueryReuse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27286) Handles exceptions on proceeding to next record in FilePartitionReader

2019-03-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27286.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24225
[https://github.com/apache/spark/pull/24225]

> Handles exceptions on proceeding to next record in FilePartitionReader
> --
>
> Key: SPARK-27286
> URL: https://issues.apache.org/jira/browse/SPARK-27286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> In data source V2, the method `PartitionReader.next()` has side effects. When 
> the method is called, the current reader proceeds to the next record.
> This might throw RuntimeException/IOException and File source V2 framework 
> should handle these exceptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27286) Handles exceptions on proceeding to next record in FilePartitionReader

2019-03-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27286:
---

Assignee: Gengliang Wang

> Handles exceptions on proceeding to next record in FilePartitionReader
> --
>
> Key: SPARK-27286
> URL: https://issues.apache.org/jira/browse/SPARK-27286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> In data source V2, the method `PartitionReader.next()` has side effects. When 
> the method is called, the current reader proceeds to the next record.
> This might throw RuntimeException/IOException and File source V2 framework 
> should handle these exceptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27182) Move the conflict source code of the sql/core module to sql/core/v1.2.1

2019-03-26 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27182.
-
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 3.0.0

> Move the conflict source code of the sql/core module to sql/core/v1.2.1
> ---
>
> Key: SPARK-27182
> URL: https://issues.apache.org/jira/browse/SPARK-27182
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25464) Dropping database can remove the hive warehouse directory contents

2019-03-26 Thread Sandeep Katta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802442#comment-16802442
 ] 

Sandeep Katta commented on SPARK-25464:
---

as per the discussion in [PR|[https://github.com/apache/spark/pull/22466]] it 
is decided not to allow to create database with location if the location is not 
empty directory, same thing is taken care. Now committers has to conclude how 
should be the behavior of this issue.

 

ping [~cloud_fan] [~hyukjin.kwon] [~LI,Xiao]

> Dropping database can remove the hive warehouse directory contents
> --
>
> Key: SPARK-25464
> URL: https://issues.apache.org/jira/browse/SPARK-25464
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Sushanta Sen
>Priority: Major
>
> Create Database.
> CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] db_name [COMMENT comment_text] 
> [*LOCATION*path] [WITH DBPROPERTIES (key1=val1, key2=val2, ...)]           
> \{{LOCATION }}If the specified path does not already exist in the underlying 
> file system, this command tries to create a directory with the path. *When 
> the database is dropped later, this directory is not deleted, 
> {color:#d04437}but currently it is deleting the directory as well.{color}
> {color:#33}please refer the below link{color}
> {color:#d04437}[databricks documentation|{color}
>  
> [https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-database.html]
>  {color:#d04437}]{color}
> if i create the database as below
> create database db1 location '/user/hive/warehouse'; //this is hive warehouse 
> directory   
> *{color:#33}on dropping this db it will also delete the warehouse 
> directory which contains the other db information.{color}*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27289) spark-submit explicit configuration does not take effect but Spark UI shows it's effective

2019-03-26 Thread Udbhav Agrawal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802441#comment-16802441
 ] 

Udbhav Agrawal commented on SPARK-27289:


thanks, i will check this issue 

> spark-submit explicit configuration does not take effect but Spark UI shows 
> it's effective
> --
>
> Key: SPARK-27289
> URL: https://issues.apache.org/jira/browse/SPARK-27289
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Documentation, Spark Submit, Web UI
>Affects Versions: 2.3.3
>Reporter: KaiXu
>Priority: Major
> Attachments: Capture.PNG
>
>
> The [doc 
> |https://spark.apache.org/docs/latest/submitting-applications.html]says that  
> "In general, configuration values explicitly set on a {{SparkConf}} take the 
> highest precedence, then flags passed to {{spark-submit}}, then values in the 
> defaults file", but when setting spark.local.dir through --conf with 
> spark-submit, it still uses the values from 
> ${SPARK_HOME}/conf/spark-defaults.conf, what's more, the Spark runtime UI 
> environment variables shows the value from --conf, which is really misleading.
> e.g.
> I set submit my application through the command:
> /opt/spark233/bin/spark-submit --properties-file /opt/spark.conf --conf 
> spark.local.dir=/tmp/spark_local -v --class 
> org.apache.spark.examples.mllib.SparseNaiveBayes --master 
> spark://bdw-slave20:7077 
> /opt/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar 
> hdfs://bdw-slave20:8020/Bayes/Input
>  
> the spark.local.dir in ${SPARK_HOME}/conf/spark-defaults.conf is:
> spark.local.dir=/mnt/nvme1/spark_local
> when the application is running, I found the intermediate shuffle data was 
> wrote to /mnt/nvme1/spark_local, which is set through 
> ${SPARK_HOME}/conf/spark-defaults.conf, but the Web UI shows that the 
> environment value spark.local.dir=/tmp/spark_local.
> The spark-submit verbose also shows spark.local.dir=/tmp/spark_local, it's 
> misleading. 
>  
> !image-2019-03-27-10-59-38-377.png!
> spark-submit verbose:
> 
> Spark properties used, including those specified through
>  --conf and those from the properties file /opt/spark.conf:
>  (spark.local.dir,/tmp/spark_local)
>  (spark.default.parallelism,132)
>  (spark.driver.memory,10g)
>  (spark.executor.memory,352g)
> X



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27289) spark-submit explicit configuration does not take effect but Spark UI shows it's effective

2019-03-26 Thread KaiXu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXu updated SPARK-27289:
--
Attachment: Capture.PNG

> spark-submit explicit configuration does not take effect but Spark UI shows 
> it's effective
> --
>
> Key: SPARK-27289
> URL: https://issues.apache.org/jira/browse/SPARK-27289
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Documentation, Spark Submit, Web UI
>Affects Versions: 2.3.3
>Reporter: KaiXu
>Priority: Major
> Attachments: Capture.PNG
>
>
> The [doc 
> |https://spark.apache.org/docs/latest/submitting-applications.html]says that  
> "In general, configuration values explicitly set on a {{SparkConf}} take the 
> highest precedence, then flags passed to {{spark-submit}}, then values in the 
> defaults file", but when setting spark.local.dir through --conf with 
> spark-submit, it still uses the values from 
> ${SPARK_HOME}/conf/spark-defaults.conf, what's more, the Spark runtime UI 
> environment variables shows the value from --conf, which is really misleading.
> e.g.
> I set submit my application through the command:
> /opt/spark233/bin/spark-submit --properties-file /opt/spark.conf --conf 
> spark.local.dir=/tmp/spark_local -v --class 
> org.apache.spark.examples.mllib.SparseNaiveBayes --master 
> spark://bdw-slave20:7077 
> /opt/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar 
> hdfs://bdw-slave20:8020/Bayes/Input
>  
> the spark.local.dir in ${SPARK_HOME}/conf/spark-defaults.conf is:
> spark.local.dir=/mnt/nvme1/spark_local
> when the application is running, I found the intermediate shuffle data was 
> wrote to /mnt/nvme1/spark_local, which is set through 
> ${SPARK_HOME}/conf/spark-defaults.conf, but the Web UI shows that the 
> environment value spark.local.dir=/tmp/spark_local.
> The spark-submit verbose also shows spark.local.dir=/tmp/spark_local, it's 
> misleading. 
>  
> !image-2019-03-27-10-59-38-377.png!
> spark-submit verbose:
> 
> Spark properties used, including those specified through
>  --conf and those from the properties file /opt/spark.conf:
>  (spark.local.dir,/tmp/spark_local)
>  (spark.default.parallelism,132)
>  (spark.driver.memory,10g)
>  (spark.executor.memory,352g)
> X



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27289) spark-submit explicit configuration does not take effect but Spark UI shows it's effective

2019-03-26 Thread KaiXu (JIRA)
KaiXu created SPARK-27289:
-

 Summary: spark-submit explicit configuration does not take effect 
but Spark UI shows it's effective
 Key: SPARK-27289
 URL: https://issues.apache.org/jira/browse/SPARK-27289
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Documentation, Spark Submit, Web UI
Affects Versions: 2.3.3
Reporter: KaiXu


The [doc 
|https://spark.apache.org/docs/latest/submitting-applications.html]says that  
"In general, configuration values explicitly set on a {{SparkConf}} take the 
highest precedence, then flags passed to {{spark-submit}}, then values in the 
defaults file", but when setting spark.local.dir through --conf with 
spark-submit, it still uses the values from 
${SPARK_HOME}/conf/spark-defaults.conf, what's more, the Spark runtime UI 
environment variables shows the value from --conf, which is really misleading.

e.g.

I set submit my application through the command:

/opt/spark233/bin/spark-submit --properties-file /opt/spark.conf --conf 
spark.local.dir=/tmp/spark_local -v --class 
org.apache.spark.examples.mllib.SparseNaiveBayes --master 
spark://bdw-slave20:7077 
/opt/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar 
hdfs://bdw-slave20:8020/Bayes/Input

 

the spark.local.dir in ${SPARK_HOME}/conf/spark-defaults.conf is:

spark.local.dir=/mnt/nvme1/spark_local

when the application is running, I found the intermediate shuffle data was 
wrote to /mnt/nvme1/spark_local, which is set through 
${SPARK_HOME}/conf/spark-defaults.conf, but the Web UI shows that the 
environment value spark.local.dir=/tmp/spark_local.

The spark-submit verbose also shows spark.local.dir=/tmp/spark_local, it's 
misleading. 

 

!image-2019-03-27-10-59-38-377.png!

spark-submit verbose:



Spark properties used, including those specified through
 --conf and those from the properties file /opt/spark.conf:
 (spark.local.dir,/tmp/spark_local)
 (spark.default.parallelism,132)
 (spark.driver.memory,10g)
 (spark.executor.memory,352g)

X



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27224) Spark to_json parses UTC timestamp incorrectly

2019-03-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802346#comment-16802346
 ] 

Hyukjin Kwon commented on SPARK-27224:
--

Can you try to set {{timestampFormat}}?

> Spark to_json parses UTC timestamp incorrectly
> --
>
> Key: SPARK-27224
> URL: https://issues.apache.org/jira/browse/SPARK-27224
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Jeff Xu
>Priority: Major
>
> When parsing ISO-8601 timestamp, if there is UTC suffix symbol, and more than 
> 3 digits in the fraction part, from_json will give incorrect result.
>  
> {noformat}
> scala> val schema = new StructType().add("t", TimestampType)
> #
> # no "Z", no problem
> #
> scala> val t = "2019-03-20T09:01:03.1234567"
> scala> Seq((s"""{"t":"${t}"}""")).toDF("json").select(from_json(col("json"), 
> schema)).show(false)
> +-+
> |jsontostructs(json)      |
> +-+
> |[2019-03-20 09:01:03.123]|
> +-+
> #
> # Add "Z", incorrect
> #
> scala> val t = "2019-03-20T09:01:03.1234567Z"
> scala> Seq((s"""{"t":"${t}"}""")).toDF("json").select(from_json(col("json"), 
> schema)).show(false)
> +-+
> |jsontostructs(json)      |
> +-+
> |[2019-03-20 02:21:37.567]|
> +-+
> #
> # reduce the # of digits, the conversion is incorrect until only we reach 3 
> digits
> #
> scala> val t = "2019-03-20T09:01:03.123456Z"
> +-+
> |jsontostructs(json)      |
> +-+
> |[2019-03-20 02:03:06.456]|
> +-+
> scala> val t = "2019-03-20T09:01:03.12345Z
> +-+
> |jsontostructs(json)      |
> +-+
> |[2019-03-20 02:01:15.345]|
> +-+
> scala> val t = "2019-03-20T09:01:03.1234Z"
> +-+
> |jsontostructs(json)      |
> +-+
> |[2019-03-20 02:01:04.234]|
> +-+
> # correct when there is <=3 digits in fraction
> scala> val t = "2019-03-20T09:01:03.123Z"
> +-+
> |jsontostructs(json)      |
> +-+
> |[2019-03-20 02:01:03.123]|
> +-+
> scala> val t = "2019-03-20T09:01:03.999Z"
> +-+
> |jsontostructs(json)      |
> +-+
> |[2019-03-20 02:01:03.999]|
> +-+
> {noformat}
>  
> This could be related to SPARK-17914.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27282) Spark incorrect results when using UNION with GROUP BY clause

2019-03-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802338#comment-16802338
 ] 

Hyukjin Kwon commented on SPARK-27282:
--

What's Toolkit? can you make the code self-reproducible?

> Spark incorrect results when using UNION with GROUP BY clause
> -
>
> Key: SPARK-27282
> URL: https://issues.apache.org/jira/browse/SPARK-27282
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, SQL
>Affects Versions: 2.3.2
> Environment: I'm using :
> IntelliJ  IDEA ==> 2018.1.4
> spark-sql and spark-core ==> 2.3.2.3.1.0.0-78 (for HDP 3.1)
> scala ==> 2.11.8
>Reporter: Sofia
>Priority: Major
>
> When using UNION clause after a GROUP BY clause in spark, the results 
> obtained are wrong.
> The following example explicit this issue:
> {code:java}
> CREATE TABLE test_un (
> col1 varchar(255),
> col2 varchar(255),
> col3 varchar(255),
> col4 varchar(255)
> );
> INSERT INTO test_un (col1, col2, col3, col4)
> VALUES (1,1,2,4),
> (1,1,2,4),
> (1,1,3,5),
> (2,2,2,null);
> {code}
> I used the following code :
> {code:java}
> val x = Toolkit.HiveToolkit.getDataFromHive("test","test_un")
> val  y = x
>.filter(col("col4")isNotNull)
>   .groupBy("col1", "col2","col3")
>   .agg(count(col("col3")).alias("cnt"))
>   .withColumn("col_name", lit("col3"))
>   .select(col("col1"), col("col2"), 
> col("col_name"),col("col3").alias("col_value"), col("cnt"))
> val z = x
>   .filter(col("col4")isNotNull)
>   .groupBy("col1", "col2","col4")
>   .agg(count(col("col4")).alias("cnt"))
>   .withColumn("col_name", lit("col4"))
>   .select(col("col1"), col("col2"), 
> col("col_name"),col("col4").alias("col_value"), col("cnt"))
> y.union(z).show()
> {code}
>  And i obtained the following results:
> ||col1||col2||col_name||col_value||cnt||
> |1|1|col3|5|1|
> |1|1|col3|4|2|
> |1|1|col4|5|1|
> |1|1|col4|4|2|
> Expected results:
> ||col1||col2||col_name||col_value||cnt||
> |1|1|col3|3|1|
> |1|1|col3|2|2|
> |1|1|col4|4|2|
> |1|1|col4|5|1|
> But when i remove the last row of the table, i obtain the correct results.
> {code:java}
> (2,2,2,null){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27288) Pruning nested field in complex map key from object serializers

2019-03-26 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-27288:
---

 Summary: Pruning nested field in complex map key from object 
serializers
 Key: SPARK-27288
 URL: https://issues.apache.org/jira/browse/SPARK-27288
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


In SPARK-26847, pruning nested field in complex map key was not supported, 
because some methods in schema pruning did't support it at that moment. Now as 
those methods support it, we can prune nested field in complex map key.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27285) Support describing output of a CTE

2019-03-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27285:
---

Assignee: Dilip Biswal

> Support describing output of a CTE
> --
>
> Key: SPARK-27285
> URL: https://issues.apache.org/jira/browse/SPARK-27285
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Minor
>
> SPARK-26982 allows users to describe output of a query. However, it had a 
> limitation of not supporting CTEs due to limitation of the grammar having a 
> single rule to parse both select and inserts. After SPARK-27209, which splits 
> select and insert parsing to two different rules, we can now support 
> describing output of the CTEs easily.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27285) Support describing output of a CTE

2019-03-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27285.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24224
[https://github.com/apache/spark/pull/24224]

> Support describing output of a CTE
> --
>
> Key: SPARK-27285
> URL: https://issues.apache.org/jira/browse/SPARK-27285
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Minor
> Fix For: 3.0.0
>
>
> SPARK-26982 allows users to describe output of a query. However, it had a 
> limitation of not supporting CTEs due to limitation of the grammar having a 
> single rule to parse both select and inserts. After SPARK-27209, which splits 
> select and insert parsing to two different rules, we can now support 
> describing output of the CTEs easily.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27269) File source v2 should validate data schema only

2019-03-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27269.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24203
[https://github.com/apache/spark/pull/24203]

> File source v2 should validate data schema only
> ---
>
> Key: SPARK-27269
> URL: https://issues.apache.org/jira/browse/SPARK-27269
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, File source v2 allows each data source to specify the supported 
> data types by implementing the method `supportsDataType` in `FileScan` and 
> `FileWriteBuilder`.
> However, in the read path, the validation checks all the data types in 
> `readSchema`, which might contain partition columns.  This is actually a 
> regression. E.g. Text data source only supports String data type, while the 
> partition columns can still contain Integer type since partition columns are 
> processed by Spark.
> This PR is to:
> 1. Refactor schema validation and check data schema only
> 2. Filter the partition columns in data schema if user specified schema 
> provided.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27269) File source v2 should validate data schema only

2019-03-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27269:


Assignee: Gengliang Wang

> File source v2 should validate data schema only
> ---
>
> Key: SPARK-27269
> URL: https://issues.apache.org/jira/browse/SPARK-27269
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, File source v2 allows each data source to specify the supported 
> data types by implementing the method `supportsDataType` in `FileScan` and 
> `FileWriteBuilder`.
> However, in the read path, the validation checks all the data types in 
> `readSchema`, which might contain partition columns.  This is actually a 
> regression. E.g. Text data source only supports String data type, while the 
> partition columns can still contain Integer type since partition columns are 
> processed by Spark.
> This PR is to:
> 1. Refactor schema validation and check data schema only
> 2. Filter the partition columns in data schema if user specified schema 
> provided.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27275) Potential corruption in EncryptedMessage.transferTo

2019-03-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27275.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Potential corruption in EncryptedMessage.transferTo
> ---
>
> Key: SPARK-27275
> URL: https://issues.apache.org/jira/browse/SPARK-27275
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> `EncryptedMessage.transferTo` has a potential corruption issue. When the 
> underlying buffer has more than `1024 * 32` bytes (this should be rare but it 
> could happen in error messages that send over the wire), it may just send a 
> partial message as `EncryptedMessage.count` becomes less than `transferred`. 
> This will cause the client hang forever (or timeout) as it will wait until 
> receiving expected length of bytes,  or weird errors (such as corruption or 
> silent correctness issue) if the channel is reused by other messages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27242) Avoid using default time zone in formatting TIMESTAMP/DATE literals

2019-03-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27242.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24181
[https://github.com/apache/spark/pull/24181]

> Avoid using default time zone in formatting TIMESTAMP/DATE literals
> ---
>
> Key: SPARK-27242
> URL: https://issues.apache.org/jira/browse/SPARK-27242
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Spark calls the toString() methods of java.sql.Timestamp/java.sql.Date in 
> formatting TIMESTAMP/DATE literals in Literal.sql: 
> https://github.com/apache/spark/blob/0f4f8160e6d01d2e263adcf39d53bd0a03fc1b73/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L373-L374
>  . This is inconsistent to parsing TIMESTAMP/DATE literals in AstBuilder: 
> https://github.com/apache/spark/blob/a529be2930b1d69015f1ac8f85e590f197cf53cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L1594-L1597
>  where *spark.sql.session.timeZone* is used in parsing TIMESTAMP literals, 
> and DATE literals are parsed independently from time zone (actually in UTC 
> time zone). The ticket aims to make parsing and formatting date/timestamp 
> literals consistent, and use the SQL config for TIMESTAMP literals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27242) Avoid using default time zone in formatting TIMESTAMP/DATE literals

2019-03-26 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27242:
---

Assignee: Maxim Gekk

> Avoid using default time zone in formatting TIMESTAMP/DATE literals
> ---
>
> Key: SPARK-27242
> URL: https://issues.apache.org/jira/browse/SPARK-27242
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Spark calls the toString() methods of java.sql.Timestamp/java.sql.Date in 
> formatting TIMESTAMP/DATE literals in Literal.sql: 
> https://github.com/apache/spark/blob/0f4f8160e6d01d2e263adcf39d53bd0a03fc1b73/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L373-L374
>  . This is inconsistent to parsing TIMESTAMP/DATE literals in AstBuilder: 
> https://github.com/apache/spark/blob/a529be2930b1d69015f1ac8f85e590f197cf53cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L1594-L1597
>  where *spark.sql.session.timeZone* is used in parsing TIMESTAMP literals, 
> and DATE literals are parsed independently from time zone (actually in UTC 
> time zone). The ticket aims to make parsing and formatting date/timestamp 
> literals consistent, and use the SQL config for TIMESTAMP literals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27287) PCAModel.load() does not honor spark configs

2019-03-26 Thread Dharmesh Kakadia (JIRA)
Dharmesh Kakadia created SPARK-27287:


 Summary: PCAModel.load() does not honor spark configs
 Key: SPARK-27287
 URL: https://issues.apache.org/jira/browse/SPARK-27287
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.4.0
Reporter: Dharmesh Kakadia


PCAModel.load() does not seem to be using the configurations set on the current 
spark session. 

Repro:

 

The following will fail to read the data because the storage account 
credentials config used/propagated. 

conf.set("fs.azure.account.key.test.blob.core.windows.net","Xosad==")

spark = 
SparkSession.builder.appName("dharmesh").config(conf=conf).master('spark://spark-master:7077').getOrCreate()

model = PCAModel.load('wasb://t...@test.blob.core.windows.net/model')

 

The following however works:

conf.set("fs.azure.account.key.test.blob.core.windows.net","Xosad==")

spark = 
SparkSession.builder.appName("dharmesh").config(conf=conf).master('spark://spark-master:7077').getOrCreate()

blah = spark.read.json('wasb://t...@test.blob.core.windows.net/somethingelse/')

blah.show()

model = PCAModel.load('wasb://t...@test.blob.core.windows.net/model')

 

It looks like spark.read...() does force the use of the config once and then 
PCAModel.load() will work correctly. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27220) Remove Yarn specific leftover from CoarseGrainedSchedulerBackend

2019-03-26 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802202#comment-16802202
 ] 

Imran Rashid commented on SPARK-27220:
--

sorry for the delay.  i don't have a really strong opinion here.  It does seem 
that field is only used by yarn.  But because its getting updated inside the 
handling of {{RegisterExecutor}}, its kind of a headache to extract.

I actually think executorIds are really always ints -- that field is only a 
string because there are places the string "driver" is used as an "executorId". 
 Eg. in standalone mode 
https://github.com/apache/spark/blob/05168e725d2a17c4164ee5f9aa068801ec2454f4/core/src/main/scala/org/apache/spark/deploy/master/ApplicationInfo.scala#L70

I feel like this is too minor to bother with, really ...

> Remove Yarn specific leftover from CoarseGrainedSchedulerBackend
> 
>
> Key: SPARK-27220
> URL: https://issues.apache.org/jira/browse/SPARK-27220
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, YARN
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.3, 2.4.0
>Reporter: Jacek Lewandowski
>Priority: Minor
>
> {{CoarseGrainedSchedulerBackend}} has the following field:
> {code:scala}
>   // The num of current max ExecutorId used to re-register appMaster
>   @volatile protected var currentExecutorIdCounter = 0
> {code}
> which is then updated:
> {code:scala}
>   case RegisterExecutor(executorId, executorRef, hostname, cores, 
> logUrls) =>
> ...
>   // This must be synchronized because variables mutated
>   // in this block are read when requesting executors
>   CoarseGrainedSchedulerBackend.this.synchronized {
> executorDataMap.put(executorId, data)
> if (currentExecutorIdCounter < executorId.toInt) {
>   currentExecutorIdCounter = executorId.toInt
> }
> ...
> {code}
> However it is never really used in {{CoarseGrainedSchedulerBackend}}. Its 
> only usage is in Yarn-specific code. It should be moved to Yarn then because 
> {{executorId}} is a {{String}} and there are really no guarantees that it is 
> always an integer. It was introduced in SPARK-12864



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24793) Make spark-submit more useful with k8s

2019-03-26 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24793:
--

Assignee: Stavros Kontopoulos

> Make spark-submit more useful with k8s
> --
>
> Key: SPARK-24793
> URL: https://issues.apache.org/jira/browse/SPARK-24793
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Assignee: Stavros Kontopoulos
>Priority: Major
>
> Support controlling the lifecycle of Spark Application through spark-submit. 
> For example:
> {{ 
>   --kill app_name   If given, kills the driver specified.
>   --status app_name  If given, requests the status of the driver 
> specified.
> }}
> Potentially also --list to list all spark drivers running.
> Given that our submission client can actually launch jobs into many different 
> namespaces, we'll need an additional specification of the namespace through a 
> --namespace flag potentially.
> I think this is pretty useful to have instead of forcing a user to use 
> kubectl to manage the lifecycle of any k8s Spark Application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24793) Make spark-submit more useful with k8s

2019-03-26 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24793.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23599
[https://github.com/apache/spark/pull/23599]

> Make spark-submit more useful with k8s
> --
>
> Key: SPARK-24793
> URL: https://issues.apache.org/jira/browse/SPARK-24793
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> Support controlling the lifecycle of Spark Application through spark-submit. 
> For example:
> {{ 
>   --kill app_name   If given, kills the driver specified.
>   --status app_name  If given, requests the status of the driver 
> specified.
> }}
> Potentially also --list to list all spark drivers running.
> Given that our submission client can actually launch jobs into many different 
> namespaces, we'll need an additional specification of the namespace through a 
> --namespace flag potentially.
> I think this is pretty useful to have instead of forcing a user to use 
> kubectl to manage the lifecycle of any k8s Spark Application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27286) Handles exceptions on proceeding to next record in FilePartitionReader

2019-03-26 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-27286:
--

 Summary: Handles exceptions on proceeding to next record in 
FilePartitionReader
 Key: SPARK-27286
 URL: https://issues.apache.org/jira/browse/SPARK-27286
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


In data source V2, the method `PartitionReader.next()` has side effects. When 
the method is called, the current reader proceeds to the next record.
This might throw RuntimeException/IOException and File source V2 framework 
should handle these exceptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24432) Add support for dynamic resource allocation

2019-03-26 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24432:
--

Assignee: Marcelo Vanzin

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Yinan Li
>Assignee: Marcelo Vanzin
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24432) Add support for dynamic resource allocation

2019-03-26 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24432:
--

Assignee: (was: Marcelo Vanzin)

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Yinan Li
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27276) Increase the minimum pyarrow version to 0.12.0

2019-03-26 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-27276:
-
Description: 
The current minimum version is 0.8.0, which is pretty ancient since Arrow has 
been moving fast and a lot has changed since this version. There are currently 
many workarounds checking for different versions or disabling specific 
functionality, and the code is getting ugly and difficult to maintain. 
Increasing the version will allow cleanup and upgrade the testing environment.

This involves changing the pyarrow version in setup.py (currently at 0.8.0), 
updating Jenkins to test against the new version, code cleanup to remove 
workarounds from older versions.  Newer versions of pyarrow have dropped 
support for Python 3.4, so it might be necessary to update to Python 3.5+ in 
Jenkins as well. Users would then need to ensure at least this version of 
pyarrow is installed on the cluster.

There is also a 0.12.1 release, so I will need to check what bugs that fixed to 
see if that will be a better version.

  was:
The current minimum version is 0.8.0, which is pretty ancient since Arrow has 
been moving fast and a lot has changed since this version. There are currently 
many workarounds checking for different versions or disabling specific 
functionality, and the code is getting ugly and difficult to maintain. 
Increasing the version will allow cleanup and upgrade the testing environment.

This involves changing the pyarrow version in setup.py (currently at 0.8.0), 
updating Jenkins to test against the new version, code cleanup to remove 
workarounds from older versions.  Users would then need to ensure this version 
is installed on the cluster.

There is also a 0.12.1 release, so I will need to check what bugs that fixed to 
see if that will be a better version.


> Increase the minimum pyarrow version to 0.12.0
> --
>
> Key: SPARK-27276
> URL: https://issues.apache.org/jira/browse/SPARK-27276
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> The current minimum version is 0.8.0, which is pretty ancient since Arrow has 
> been moving fast and a lot has changed since this version. There are 
> currently many workarounds checking for different versions or disabling 
> specific functionality, and the code is getting ugly and difficult to 
> maintain. Increasing the version will allow cleanup and upgrade the testing 
> environment.
> This involves changing the pyarrow version in setup.py (currently at 0.8.0), 
> updating Jenkins to test against the new version, code cleanup to remove 
> workarounds from older versions.  Newer versions of pyarrow have dropped 
> support for Python 3.4, so it might be necessary to update to Python 3.5+ in 
> Jenkins as well. Users would then need to ensure at least this version of 
> pyarrow is installed on the cluster.
> There is also a 0.12.1 release, so I will need to check what bugs that fixed 
> to see if that will be a better version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27269) File source v2 should validate data schema only

2019-03-26 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-27269:
---
Issue Type: Bug  (was: Task)

> File source v2 should validate data schema only
> ---
>
> Key: SPARK-27269
> URL: https://issues.apache.org/jira/browse/SPARK-27269
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Currently, File source v2 allows each data source to specify the supported 
> data types by implementing the method `supportsDataType` in `FileScan` and 
> `FileWriteBuilder`.
> However, in the read path, the validation checks all the data types in 
> `readSchema`, which might contain partition columns.  This is actually a 
> regression. E.g. Text data source only supports String data type, while the 
> partition columns can still contain Integer type since partition columns are 
> processed by Spark.
> This PR is to:
> 1. Refactor schema validation and check data schema only
> 2. Filter the partition columns in data schema if user specified schema 
> provided.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data

2019-03-26 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801960#comment-16801960
 ] 

Imran Rashid commented on SPARK-21097:
--

{quote}
I'm wondering if this is going to be subsumed by the Shuffle Service redesign 
proposal.
{quote}

The shuffle service redesign does not subsume this.  It only covers shuffle 
blocks currently.  While there are some commonalities in shuffle vs. cached 
blocks, there are also enough differences they need special handling.

You could *extend* the redesigned shuffle service to do something like this, 
but it wouldn't be exactly the same.  You'd be giving up locality on the 
executors if you move cached rdd blocks over, so it may be tricky to decide 
when to move blocks over.  It would definitely be additional work we'd do on 
top of the shuffle service (which I think might make the most sense as the way 
to do this in any case).

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Priority: Major
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27282) Spark incorrect results when using UNION with GROUP BY clause

2019-03-26 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801930#comment-16801930
 ] 

Marco Gaido commented on SPARK-27282:
-

Please do not use Blocker/Critical as they are reserved for committers. This 
bug has been already fixed I think: I am not able to reproduce on master. May 
you please try as well? Unfortunately I am not sure which JIRA actually fixed 
this behavior though...

> Spark incorrect results when using UNION with GROUP BY clause
> -
>
> Key: SPARK-27282
> URL: https://issues.apache.org/jira/browse/SPARK-27282
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, SQL
>Affects Versions: 2.3.2
> Environment: I'm using :
> IntelliJ  IDEA ==> 2018.1.4
> spark-sql and spark-core ==> 2.3.2.3.1.0.0-78 (for HDP 3.1)
> scala ==> 2.11.8
>Reporter: Sofia
>Priority: Major
>
> When using UNION clause after a GROUP BY clause in spark, the results 
> obtained are wrong.
> The following example explicit this issue:
> {code:java}
> CREATE TABLE test_un (
> col1 varchar(255),
> col2 varchar(255),
> col3 varchar(255),
> col4 varchar(255)
> );
> INSERT INTO test_un (col1, col2, col3, col4)
> VALUES (1,1,2,4),
> (1,1,2,4),
> (1,1,3,5),
> (2,2,2,null);
> {code}
> I used the following code :
> {code:java}
> val x = Toolkit.HiveToolkit.getDataFromHive("test","test_un")
> val  y = x
>.filter(col("col4")isNotNull)
>   .groupBy("col1", "col2","col3")
>   .agg(count(col("col3")).alias("cnt"))
>   .withColumn("col_name", lit("col3"))
>   .select(col("col1"), col("col2"), 
> col("col_name"),col("col3").alias("col_value"), col("cnt"))
> val z = x
>   .filter(col("col4")isNotNull)
>   .groupBy("col1", "col2","col4")
>   .agg(count(col("col4")).alias("cnt"))
>   .withColumn("col_name", lit("col4"))
>   .select(col("col1"), col("col2"), 
> col("col_name"),col("col4").alias("col_value"), col("cnt"))
> y.union(z).show()
> {code}
>  And i obtained the following results:
> ||col1||col2||col_name||col_value||cnt||
> |1|1|col3|5|1|
> |1|1|col3|4|2|
> |1|1|col4|5|1|
> |1|1|col4|4|2|
> Expected results:
> ||col1||col2||col_name||col_value||cnt||
> |1|1|col3|3|1|
> |1|1|col3|2|2|
> |1|1|col4|4|2|
> |1|1|col4|5|1|
> But when i remove the last row of the table, i obtain the correct results.
> {code:java}
> (2,2,2,null){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27282) Spark incorrect results when using UNION with GROUP BY clause

2019-03-26 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-27282:

Priority: Major  (was: Blocker)

> Spark incorrect results when using UNION with GROUP BY clause
> -
>
> Key: SPARK-27282
> URL: https://issues.apache.org/jira/browse/SPARK-27282
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, SQL
>Affects Versions: 2.3.2
> Environment: I'm using :
> IntelliJ  IDEA ==> 2018.1.4
> spark-sql and spark-core ==> 2.3.2.3.1.0.0-78 (for HDP 3.1)
> scala ==> 2.11.8
>Reporter: Sofia
>Priority: Major
>
> When using UNION clause after a GROUP BY clause in spark, the results 
> obtained are wrong.
> The following example explicit this issue:
> {code:java}
> CREATE TABLE test_un (
> col1 varchar(255),
> col2 varchar(255),
> col3 varchar(255),
> col4 varchar(255)
> );
> INSERT INTO test_un (col1, col2, col3, col4)
> VALUES (1,1,2,4),
> (1,1,2,4),
> (1,1,3,5),
> (2,2,2,null);
> {code}
> I used the following code :
> {code:java}
> val x = Toolkit.HiveToolkit.getDataFromHive("test","test_un")
> val  y = x
>.filter(col("col4")isNotNull)
>   .groupBy("col1", "col2","col3")
>   .agg(count(col("col3")).alias("cnt"))
>   .withColumn("col_name", lit("col3"))
>   .select(col("col1"), col("col2"), 
> col("col_name"),col("col3").alias("col_value"), col("cnt"))
> val z = x
>   .filter(col("col4")isNotNull)
>   .groupBy("col1", "col2","col4")
>   .agg(count(col("col4")).alias("cnt"))
>   .withColumn("col_name", lit("col4"))
>   .select(col("col1"), col("col2"), 
> col("col_name"),col("col4").alias("col_value"), col("cnt"))
> y.union(z).show()
> {code}
>  And i obtained the following results:
> ||col1||col2||col_name||col_value||cnt||
> |1|1|col3|5|1|
> |1|1|col3|4|2|
> |1|1|col4|5|1|
> |1|1|col4|4|2|
> Expected results:
> ||col1||col2||col_name||col_value||cnt||
> |1|1|col3|3|1|
> |1|1|col3|2|2|
> |1|1|col4|4|2|
> |1|1|col4|5|1|
> But when i remove the last row of the table, i obtain the correct results.
> {code:java}
> (2,2,2,null){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27285) Support describing output of a CTE

2019-03-26 Thread Dilip Biswal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dilip Biswal updated SPARK-27285:
-
Priority: Minor  (was: Major)

> Support describing output of a CTE
> --
>
> Key: SPARK-27285
> URL: https://issues.apache.org/jira/browse/SPARK-27285
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Priority: Minor
>
> SPARK-26982 allows users to describe output of a query. However, it had a 
> limitation of not supporting CTEs due to limitation of the grammar having a 
> single rule to parse both select and inserts. After SPARK-27209, which splits 
> select and insert parsing to two different rules, we can now support 
> describing output of the CTEs easily.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27285) Support describing output of a CTE

2019-03-26 Thread Dilip Biswal (JIRA)
Dilip Biswal created SPARK-27285:


 Summary: Support describing output of a CTE
 Key: SPARK-27285
 URL: https://issues.apache.org/jira/browse/SPARK-27285
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Dilip Biswal


SPARK-26982 allows users to describe output of a query. However, it had a 
limitation of not supporting CTEs due to limitation of the grammar having a 
single rule to parse both select and inserts. After SPARK-27209, which splits 
select and insert parsing to two different rules, we can now support describing 
output of the CTEs easily.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27224) Spark to_json parses UTC timestamp incorrectly

2019-03-26 Thread Jurriaan Pruis (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801802#comment-16801802
 ] 

Jurriaan Pruis commented on SPARK-27224:


[~hyukjin.kwon] is this the same as 
https://issues.apache.org/jira/browse/SPARK-17914?focusedCommentId=16750292&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16750292
 ? We're also having this issue when upgrading to 2.4.0.

> Spark to_json parses UTC timestamp incorrectly
> --
>
> Key: SPARK-27224
> URL: https://issues.apache.org/jira/browse/SPARK-27224
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Jeff Xu
>Priority: Major
>
> When parsing ISO-8601 timestamp, if there is UTC suffix symbol, and more than 
> 3 digits in the fraction part, from_json will give incorrect result.
>  
> {noformat}
> scala> val schema = new StructType().add("t", TimestampType)
> #
> # no "Z", no problem
> #
> scala> val t = "2019-03-20T09:01:03.1234567"
> scala> Seq((s"""{"t":"${t}"}""")).toDF("json").select(from_json(col("json"), 
> schema)).show(false)
> +-+
> |jsontostructs(json)      |
> +-+
> |[2019-03-20 09:01:03.123]|
> +-+
> #
> # Add "Z", incorrect
> #
> scala> val t = "2019-03-20T09:01:03.1234567Z"
> scala> Seq((s"""{"t":"${t}"}""")).toDF("json").select(from_json(col("json"), 
> schema)).show(false)
> +-+
> |jsontostructs(json)      |
> +-+
> |[2019-03-20 02:21:37.567]|
> +-+
> #
> # reduce the # of digits, the conversion is incorrect until only we reach 3 
> digits
> #
> scala> val t = "2019-03-20T09:01:03.123456Z"
> +-+
> |jsontostructs(json)      |
> +-+
> |[2019-03-20 02:03:06.456]|
> +-+
> scala> val t = "2019-03-20T09:01:03.12345Z
> +-+
> |jsontostructs(json)      |
> +-+
> |[2019-03-20 02:01:15.345]|
> +-+
> scala> val t = "2019-03-20T09:01:03.1234Z"
> +-+
> |jsontostructs(json)      |
> +-+
> |[2019-03-20 02:01:04.234]|
> +-+
> # correct when there is <=3 digits in fraction
> scala> val t = "2019-03-20T09:01:03.123Z"
> +-+
> |jsontostructs(json)      |
> +-+
> |[2019-03-20 02:01:03.123]|
> +-+
> scala> val t = "2019-03-20T09:01:03.999Z"
> +-+
> |jsontostructs(json)      |
> +-+
> |[2019-03-20 02:01:03.999]|
> +-+
> {noformat}
>  
> This could be related to SPARK-17914.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17914) Spark SQL casting to TimestampType with nanosecond results in incorrect timestamp

2019-03-26 Thread Jurriaan Pruis (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801795#comment-16801795
 ] 

Jurriaan Pruis commented on SPARK-17914:


I'm also seeing this issue where the millisecond part 'overflows' into the rest 
of the timestamp in Spark 2.4.0 as described in the comment above. To me it 
seems like this issue isn't resolved yet. cc [~ueshin]

> Spark SQL casting to TimestampType with nanosecond results in incorrect 
> timestamp
> -
>
> Key: SPARK-17914
> URL: https://issues.apache.org/jira/browse/SPARK-17914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Oksana Romankova
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 2.2.0, 2.3.0
>
>
> In some cases when timestamps contain nanoseconds they will be parsed 
> incorrectly. 
> Examples: 
> "2016-05-14T15:12:14.0034567Z" -> "2016-05-14 15:12:14.034567"
> "2016-05-14T15:12:14.000345678Z" -> "2016-05-14 15:12:14.345678"
> The issue seems to be happening in DateTimeUtils.stringToTimestamp(). It 
> assumes that only 6 digit fraction of a second will be passed.
> With this being the case I would suggest either discarding nanoseconds 
> automatically, or throw an exception prompting to pre-format timestamps to 
> microsecond precision first before casting to the Timestamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27283) BigDecimal arithmetic losing precision

2019-03-26 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801778#comment-16801778
 ] 

Marco Gaido commented on SPARK-27283:
-

[~Mats_SX] another issue which could happen using Decimal instead of double is 
not being able to represent very big values (Spark big decimal max precision is 
38).

> BigDecimal arithmetic losing precision
> --
>
> Key: SPARK-27283
> URL: https://issues.apache.org/jira/browse/SPARK-27283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Mats
>Priority: Minor
>  Labels: decimal, float, sql
>
> When performing arithmetics between doubles and decimals, the resulting value 
> is always a double. This is very strange to me; when an exact type is present 
> as one of the inputs, I would expect that the inexact type is lifted and the 
> result presented exactly, rather than lowering the exact type to the inexact 
> and presenting a result that may contain rounding errors. The choice to use a 
> decimal was probably taken because rounding errors were deemed an issue.
> When performing arithmetics between decimals and integers, the expected 
> behaviour is seen; the result is a decimal.
> See the following example:
> {code:java}
> import org.apache.spark.sql.functions
> val df = sparkSession.createDataFrame(Seq(Tuple1(0L))).toDF("a")
> val decimalInt = df.select(functions.lit(BigDecimal(3.14)) + functions.lit(1) 
> as "d")
> val decimalDouble = df.select(functions.lit(BigDecimal(3.14)) + 
> functions.lit(1.0) as "d")
> decimalInt.schema.printTreeString()
> decimalInt.show()
> decimalDouble.schema.printTreeString()
> decimalDouble.show(){code}
> which produces this output (with possible variation on the rounding error):
> {code:java}
> root
> |-- d: decimal(4,2) (nullable = true)
> ++
> | d  |
> ++
> |4.14|
> ++
> root
> |-- d: double (nullable = false)
> +-+
> | d   |
> +-+
> |4.141|
> +-+
> {code}
>  
>  I would argue that this is a bug, and that the correct thing to do would be 
> to lift the result to a decimal also when one operand is a double.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27248) REFRESH TABLE should recreate cache with same cache name and storage level

2019-03-26 Thread William Wong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801702#comment-16801702
 ] 

William Wong commented on SPARK-27248:
--

https://github.com/apache/spark/pull/24221

Hi @Hyukjin, just created a PR. Hope it is good enough. If not, please let me 
know and I will fix it. Many thanks. Best regards, William

> REFRESH TABLE should recreate cache with same cache name and storage level
> --
>
> Key: SPARK-27248
> URL: https://issues.apache.org/jira/browse/SPARK-27248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: William Wong
>Priority: Major
>
> If we refresh a cached table, the table cache will be first uncached and then 
> recache (lazily). Currently, the logic is embedded in 
> CatalogImpl.refreshTable method.
> The current implementation does not preserve the cache name and storage 
> level. As a result, cache name and cache level could be changed after a 
> REFERSH. IMHO, it is not what a user would expect.
> I would like to fix this behavior by first save the cache name and storage 
> level for recaching the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27284) Spark Standalone aggregated logs in 1 file per appid (ala yarn -logs -applicationId)

2019-03-26 Thread t oo (JIRA)
t oo created SPARK-27284:


 Summary: Spark Standalone aggregated logs in 1 file per appid (ala 
 yarn -logs -applicationId)
 Key: SPARK-27284
 URL: https://issues.apache.org/jira/browse/SPARK-27284
 Project: Spark
  Issue Type: New Feature
  Components: Scheduler
Affects Versions: 2.4.0, 2.3.3
Reporter: t oo


Feature: Spark Standalone aggregated logs in 1 file per appid (ala  yarn -logs 
-applicationId)

 

This would be 1 single file per appid with contents of ALL the executors logs

[https://stackoverflow.com/questions/46004528/how-can-i-see-the-aggregated-logs-for-a-spark-standalone-cluster]

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27277) Recover from setting fix version failure in merge script

2019-03-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27277.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24213
[https://github.com/apache/spark/pull/24213]

> Recover from setting fix version failure in merge script
> 
>
> Key: SPARK-27277
> URL: https://issues.apache.org/jira/browse/SPARK-27277
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> I happened to meet this cases few times before:
> {code}
> Enter comma-separated fix version(s) [3.0.0]: 3.0,0
> Restoring head pointer to master
> git checkout master
> Already on 'master'
> git branch
> Traceback (most recent call last):
>   File "./dev/merge_spark_pr_jira.py", line 537, in 
> main()
>   File "./dev/merge_spark_pr_jira.py", line 523, in main
> resolve_jira_issues(title, merged_refs, jira_comment)
>   File "./dev/merge_spark_pr_jira.py", line 359, in resolve_jira_issues
> resolve_jira_issue(merge_branches, comment, jira_id)
>   File "./dev/merge_spark_pr_jira.py", line 302, in resolve_jira_issue
> jira_fix_versions = map(lambda v: get_version_json(v), fix_versions)
>   File "./dev/merge_spark_pr_jira.py", line 302, in 
> jira_fix_versions = map(lambda v: get_version_json(v), fix_versions)
>   File "./dev/merge_spark_pr_jira.py", line 300, in get_version_json
> return filter(lambda v: v.name == version_str, versions)[0].raw
> IndexError: list index out of range
> {code}
> I typed the fix version wrongly and it ended the loop in the merge script. 
> Not a big deal but it bugged me few times.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27277) Recover from setting fix version failure in merge script

2019-03-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27277:


Assignee: Hyukjin Kwon

> Recover from setting fix version failure in merge script
> 
>
> Key: SPARK-27277
> URL: https://issues.apache.org/jira/browse/SPARK-27277
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> I happened to meet this cases few times before:
> {code}
> Enter comma-separated fix version(s) [3.0.0]: 3.0,0
> Restoring head pointer to master
> git checkout master
> Already on 'master'
> git branch
> Traceback (most recent call last):
>   File "./dev/merge_spark_pr_jira.py", line 537, in 
> main()
>   File "./dev/merge_spark_pr_jira.py", line 523, in main
> resolve_jira_issues(title, merged_refs, jira_comment)
>   File "./dev/merge_spark_pr_jira.py", line 359, in resolve_jira_issues
> resolve_jira_issue(merge_branches, comment, jira_id)
>   File "./dev/merge_spark_pr_jira.py", line 302, in resolve_jira_issue
> jira_fix_versions = map(lambda v: get_version_json(v), fix_versions)
>   File "./dev/merge_spark_pr_jira.py", line 302, in 
> jira_fix_versions = map(lambda v: get_version_json(v), fix_versions)
>   File "./dev/merge_spark_pr_jira.py", line 300, in get_version_json
> return filter(lambda v: v.name == version_str, versions)[0].raw
> IndexError: list index out of range
> {code}
> I typed the fix version wrongly and it ended the loop in the merge script. 
> Not a big deal but it bugged me few times.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27282) Spark incorrect results when using UNION with GROUP BY clause

2019-03-26 Thread Sofia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sofia updated SPARK-27282:
--
Priority: Blocker  (was: Major)

> Spark incorrect results when using UNION with GROUP BY clause
> -
>
> Key: SPARK-27282
> URL: https://issues.apache.org/jira/browse/SPARK-27282
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, SQL
>Affects Versions: 2.3.2
> Environment: I'm using :
> IntelliJ  IDEA ==> 2018.1.4
> spark-sql and spark-core ==> 2.3.2.3.1.0.0-78 (for HDP 3.1)
> scala ==> 2.11.8
>Reporter: Sofia
>Priority: Blocker
>
> When using UNION clause after a GROUP BY clause in spark, the results 
> obtained are wrong.
> The following example explicit this issue:
> {code:java}
> CREATE TABLE test_un (
> col1 varchar(255),
> col2 varchar(255),
> col3 varchar(255),
> col4 varchar(255)
> );
> INSERT INTO test_un (col1, col2, col3, col4)
> VALUES (1,1,2,4),
> (1,1,2,4),
> (1,1,3,5),
> (2,2,2,null);
> {code}
> I used the following code :
> {code:java}
> val x = Toolkit.HiveToolkit.getDataFromHive("test","test_un")
> val  y = x
>.filter(col("col4")isNotNull)
>   .groupBy("col1", "col2","col3")
>   .agg(count(col("col3")).alias("cnt"))
>   .withColumn("col_name", lit("col3"))
>   .select(col("col1"), col("col2"), 
> col("col_name"),col("col3").alias("col_value"), col("cnt"))
> val z = x
>   .filter(col("col4")isNotNull)
>   .groupBy("col1", "col2","col4")
>   .agg(count(col("col4")).alias("cnt"))
>   .withColumn("col_name", lit("col4"))
>   .select(col("col1"), col("col2"), 
> col("col_name"),col("col4").alias("col_value"), col("cnt"))
> y.union(z).show()
> {code}
>  And i obtained the following results:
> ||col1||col2||col_name||col_value||cnt||
> |1|1|col3|5|1|
> |1|1|col3|4|2|
> |1|1|col4|5|1|
> |1|1|col4|4|2|
> Expected results:
> ||col1||col2||col_name||col_value||cnt||
> |1|1|col3|3|1|
> |1|1|col3|2|2|
> |1|1|col4|4|2|
> |1|1|col4|5|1|
> But when i remove the last row of the table, i obtain the correct results.
> {code:java}
> (2,2,2,null){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27283) BigDecimal arithmetic losing precision

2019-03-26 Thread Mats (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801532#comment-16801532
 ] 

Mats commented on SPARK-27283:
--

I tried this in MySQL, which does not produce SparkSQL's behaviour:

(foo is a table with a column num of type decimal(18,4))
{code:java}
CREATE TABLE foo2 AS 
SELECT num + 1.0 AS num 
FROM foo;
{code}
This creates a table foo2 with a column num of type decimal(19,4).

 

> BigDecimal arithmetic losing precision
> --
>
> Key: SPARK-27283
> URL: https://issues.apache.org/jira/browse/SPARK-27283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Mats
>Priority: Minor
>  Labels: decimal, float, sql
>
> When performing arithmetics between doubles and decimals, the resulting value 
> is always a double. This is very strange to me; when an exact type is present 
> as one of the inputs, I would expect that the inexact type is lifted and the 
> result presented exactly, rather than lowering the exact type to the inexact 
> and presenting a result that may contain rounding errors. The choice to use a 
> decimal was probably taken because rounding errors were deemed an issue.
> When performing arithmetics between decimals and integers, the expected 
> behaviour is seen; the result is a decimal.
> See the following example:
> {code:java}
> import org.apache.spark.sql.functions
> val df = sparkSession.createDataFrame(Seq(Tuple1(0L))).toDF("a")
> val decimalInt = df.select(functions.lit(BigDecimal(3.14)) + functions.lit(1) 
> as "d")
> val decimalDouble = df.select(functions.lit(BigDecimal(3.14)) + 
> functions.lit(1.0) as "d")
> decimalInt.schema.printTreeString()
> decimalInt.show()
> decimalDouble.schema.printTreeString()
> decimalDouble.show(){code}
> which produces this output (with possible variation on the rounding error):
> {code:java}
> root
> |-- d: decimal(4,2) (nullable = true)
> ++
> | d  |
> ++
> |4.14|
> ++
> root
> |-- d: double (nullable = false)
> +-+
> | d   |
> +-+
> |4.141|
> +-+
> {code}
>  
>  I would argue that this is a bug, and that the correct thing to do would be 
> to lift the result to a decimal also when one operand is a double.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27245) Optimizer repeat Python UDF calls

2019-03-26 Thread Andrea Rota (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrea Rota resolved SPARK-27245.
-
Resolution: Duplicate

Duplicated issue, the behavior is by design as the optimizer might choose to 
collapse projects and call UDFs more than once. SPARK-17728 suggests a 
workaround using explode to isolate part of the DAG from the point of view of 
the optimizer.

> Optimizer repeat Python UDF calls
> -
>
> Key: SPARK-27245
> URL: https://issues.apache.org/jira/browse/SPARK-27245
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.3.1, 2.3.2, 2.4.0
> Environment: Tested both on Linux and Windows, on my computer and on 
> Databricks.
> Spark version: 2.3.1
> Python version: 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) 
> I tried different releases of Spark too (2.4.0, 2.3.2), the behaviour 
> persists.
>Reporter: Andrea Rota
>Priority: Major
>  Labels: optimizer, performance, planner, pyspark, udf
>
> The physical plan proposed by .explain() method shows an inefficient way to 
> call Python UDFs in PySpark.
> This behaviour take place under these circustances:
>  * PySpark API
>  * At least one operation in the DAG that uses the result of the Python UDF
> My expectation is that the optimizer should call once the Python UDF with 
> BatchEvalPython and then reuse the result across following steps.
> The optimizer prefers to call n times the same UDF, with the same parameters 
> within the same BatchEvalPython, and only uses one of the result columns 
> (PythonUDF2#16) while discarding the others.
> I believe that could lead to poor performances due to the large data exchange 
> with Python processes and due to the additional calls.
> Example code:
> {code:python}
> foo_udf = f.udf(lambda x: 1, IntegerType())
> df = spark.createDataFrame([['bar']]) \
> .withColumn('result', foo_udf(f.col('_1'))) \
> .withColumn('a', f.col('result')) \
> .withColumn('b', f.col('result'))
> df.explain()
> {code}
> {code}
> == Physical Plan ==
> *(1) Project [_1#0, pythonUDF2#16 AS result#2, pythonUDF2#16 AS a#5, 
> pythonUDF2#16 AS b#9]
> +- BatchEvalPython [(_1#0), (_1#0), (_1#0)], [_1#0, 
> pythonUDF0#14, pythonUDF1#15, pythonUDF2#16]
>+- Scan ExistingRDD[_1#0]
> {code}
> Full code on Gist: 
> [https://gist.github.com/andrearota/f77b6a293421a3f26dd5d2fb0a04046e]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27283) BigDecimal arithmetic losing precision

2019-03-26 Thread Mats (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mats updated SPARK-27283:
-
Description: 
When performing arithmetics between doubles and decimals, the resulting value 
is always a double. This is very strange to me; when an exact type is present 
as one of the inputs, I would expect that the inexact type is lifted and the 
result presented exactly, rather than lowering the exact type to the inexact 
and presenting a result that may contain rounding errors. The choice to use a 
decimal was probably taken because rounding errors were deemed an issue.

When performing arithmetics between decimals and integers, the expected 
behaviour is seen; the result is a decimal.

See the following example:
{code:java}
import org.apache.spark.sql.functions
val df = sparkSession.createDataFrame(Seq(Tuple1(0L))).toDF("a")

val decimalInt = df.select(functions.lit(BigDecimal(3.14)) + functions.lit(1) 
as "d")
val decimalDouble = df.select(functions.lit(BigDecimal(3.14)) + 
functions.lit(1.0) as "d")

decimalInt.schema.printTreeString()
decimalInt.show()
decimalDouble.schema.printTreeString()
decimalDouble.show(){code}
which produces this output (with possible variation on the rounding error):
{code:java}
root
|-- d: decimal(4,2) (nullable = true)

++
| d  |
++
|4.14|
++

root
|-- d: double (nullable = false)

+-+
| d   |
+-+
|4.141|
+-+
{code}
 

 I would argue that this is a bug, and that the correct thing to do would be to 
lift the result to a decimal also when one operand is a double.

  was:
When performing arithmetics between doubles and decimals, the resulting value 
is always a double. This is very strange to me; when an exact type is present 
as one of the inputs, I would expect that the inexact type is lifted and the 
result presented exactly, rather than lowering the exact type to the inexact 
and presenting a result that may contain rounding errors. The choice to use a 
decimal was probably taken because rounding errors were deemed an issue.

When performing arithmetics between decimals and integers, the expected 
behaviour is seen; the result is a decimal.

See the following example:
{code:java}
import org.apache.spark.sql.functions
val df = sparkSession.createDataFrame(Seq(Tuple1(0L))).toDF("a")

val decimalInt = df.select(functions.lit(BigDecimal(3.14)) + functions.lit(1) 
as "d")
val decimalDouble = df.select(functions.lit(BigDecimal(3.14)) + 
functions.lit(1.0) as "d")

decimalInt.schema.printTreeString()
decimalInt.show()
decimalDouble.schema.printTreeString()
decimalDouble.show(){code}
which produces this output (with possible variation on the rounding error):


{code:java}
root
|-- d: decimal(4,2) (nullable = true)

++
| d  |
++
|4.14|
++

root
|-- d: double (nullable = false)

+-+
| d   |
+-+
|4.141|
+-+
{code}
 

 


> BigDecimal arithmetic losing precision
> --
>
> Key: SPARK-27283
> URL: https://issues.apache.org/jira/browse/SPARK-27283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Mats
>Priority: Minor
>  Labels: decimal, float, sql
>
> When performing arithmetics between doubles and decimals, the resulting value 
> is always a double. This is very strange to me; when an exact type is present 
> as one of the inputs, I would expect that the inexact type is lifted and the 
> result presented exactly, rather than lowering the exact type to the inexact 
> and presenting a result that may contain rounding errors. The choice to use a 
> decimal was probably taken because rounding errors were deemed an issue.
> When performing arithmetics between decimals and integers, the expected 
> behaviour is seen; the result is a decimal.
> See the following example:
> {code:java}
> import org.apache.spark.sql.functions
> val df = sparkSession.createDataFrame(Seq(Tuple1(0L))).toDF("a")
> val decimalInt = df.select(functions.lit(BigDecimal(3.14)) + functions.lit(1) 
> as "d")
> val decimalDouble = df.select(functions.lit(BigDecimal(3.14)) + 
> functions.lit(1.0) as "d")
> decimalInt.schema.printTreeString()
> decimalInt.show()
> decimalDouble.schema.printTreeString()
> decimalDouble.show(){code}
> which produces this output (with possible variation on the rounding error):
> {code:java}
> root
> |-- d: decimal(4,2) (nullable = true)
> ++
> | d  |
> ++
> |4.14|
> ++
> root
> |-- d: double (nullable = false)
> +-+
> | d   |
> +-+
> |4.141|
> +-+
> {code}
>  
>  I would argue that this is a bug, and that the correct thing to do would be 
> to lift the result to a decimal also when one ope

[jira] [Updated] (SPARK-27283) BigDecimal arithmetic losing precision

2019-03-26 Thread Mats (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mats updated SPARK-27283:
-
Issue Type: Bug  (was: Question)

> BigDecimal arithmetic losing precision
> --
>
> Key: SPARK-27283
> URL: https://issues.apache.org/jira/browse/SPARK-27283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Mats
>Priority: Minor
>  Labels: decimal, float, sql
>
> When performing arithmetics between doubles and decimals, the resulting value 
> is always a double. This is very strange to me; when an exact type is present 
> as one of the inputs, I would expect that the inexact type is lifted and the 
> result presented exactly, rather than lowering the exact type to the inexact 
> and presenting a result that may contain rounding errors. The choice to use a 
> decimal was probably taken because rounding errors were deemed an issue.
> When performing arithmetics between decimals and integers, the expected 
> behaviour is seen; the result is a decimal.
> See the following example:
> {code:java}
> import org.apache.spark.sql.functions
> val df = sparkSession.createDataFrame(Seq(Tuple1(0L))).toDF("a")
> val decimalInt = df.select(functions.lit(BigDecimal(3.14)) + functions.lit(1) 
> as "d")
> val decimalDouble = df.select(functions.lit(BigDecimal(3.14)) + 
> functions.lit(1.0) as "d")
> decimalInt.schema.printTreeString()
> decimalInt.show()
> decimalDouble.schema.printTreeString()
> decimalDouble.show(){code}
> which produces this output (with possible variation on the rounding error):
> {code:java}
> root
> |-- d: decimal(4,2) (nullable = true)
> ++
> | d  |
> ++
> |4.14|
> ++
> root
> |-- d: double (nullable = false)
> +-+
> | d   |
> +-+
> |4.141|
> +-+
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27283) BigDecimal arithmetic losing precision

2019-03-26 Thread Mats (JIRA)
Mats created SPARK-27283:


 Summary: BigDecimal arithmetic losing precision
 Key: SPARK-27283
 URL: https://issues.apache.org/jira/browse/SPARK-27283
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 2.4.0
Reporter: Mats


When performing arithmetics between doubles and decimals, the resulting value 
is always a double. This is very strange to me; when an exact type is present 
as one of the inputs, I would expect that the inexact type is lifted and the 
result presented exactly, rather than lowering the exact type to the inexact 
and presenting a result that may contain rounding errors. The choice to use a 
decimal was probably taken because rounding errors were deemed an issue.

When performing arithmetics between decimals and integers, the expected 
behaviour is seen; the result is a decimal.

See the following example:
{code:java}
import org.apache.spark.sql.functions
val df = sparkSession.createDataFrame(Seq(Tuple1(0L))).toDF("a")

val decimalInt = df.select(functions.lit(BigDecimal(3.14)) + functions.lit(1) 
as "d")
val decimalDouble = df.select(functions.lit(BigDecimal(3.14)) + 
functions.lit(1.0) as "d")

decimalInt.schema.printTreeString()
decimalInt.show()
decimalDouble.schema.printTreeString()
decimalDouble.show(){code}
which produces this output (with possible variation on the rounding error):


{code:java}
root
|-- d: decimal(4,2) (nullable = true)

++
| d  |
++
|4.14|
++

root
|-- d: double (nullable = false)

+-+
| d   |
+-+
|4.141|
+-+
{code}
 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27282) Spark incorrect results when using UNION with GROUP BY clause

2019-03-26 Thread Sofia (JIRA)
Sofia created SPARK-27282:
-

 Summary: Spark incorrect results when using UNION with GROUP BY 
clause
 Key: SPARK-27282
 URL: https://issues.apache.org/jira/browse/SPARK-27282
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Spark Submit, SQL
Affects Versions: 2.3.2
 Environment: I'm using :

IntelliJ  IDEA ==> 2018.1.4

spark-sql and spark-core ==> 2.3.2.3.1.0.0-78 (for HDP 3.1)

scala ==> 2.11.8
Reporter: Sofia


When using UNION clause after a GROUP BY clause in spark, the results obtained 
are wrong.

The following example explicit this issue:
{code:java}
CREATE TABLE test_un (
col1 varchar(255),
col2 varchar(255),
col3 varchar(255),
col4 varchar(255)
);

INSERT INTO test_un (col1, col2, col3, col4)
VALUES (1,1,2,4),
(1,1,2,4),
(1,1,3,5),
(2,2,2,null);
{code}
I used the following code :
{code:java}
val x = Toolkit.HiveToolkit.getDataFromHive("test","test_un")

val  y = x
   .filter(col("col4")isNotNull)
  .groupBy("col1", "col2","col3")
  .agg(count(col("col3")).alias("cnt"))
  .withColumn("col_name", lit("col3"))
  .select(col("col1"), col("col2"), 
col("col_name"),col("col3").alias("col_value"), col("cnt"))

val z = x
  .filter(col("col4")isNotNull)
  .groupBy("col1", "col2","col4")
  .agg(count(col("col4")).alias("cnt"))
  .withColumn("col_name", lit("col4"))
  .select(col("col1"), col("col2"), 
col("col_name"),col("col4").alias("col_value"), col("cnt"))

y.union(z).show()
{code}
 And i obtained the following results:
||col1||col2||col_name||col_value||cnt||
|1|1|col3|5|1|
|1|1|col3|4|2|
|1|1|col4|5|1|
|1|1|col4|4|2|

Expected results:
||col1||col2||col_name||col_value||cnt||
|1|1|col3|3|1|
|1|1|col3|2|2|
|1|1|col4|4|2|
|1|1|col4|5|1|

But when i remove the last row of the table, i obtain the correct results.
{code:java}
(2,2,2,null){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27281) Wrong latest offsets returned by DirectKafkaInputDStream#latestOffsets

2019-03-26 Thread Viacheslav Krot (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viacheslav Krot updated SPARK-27281:

Description: 
I have a very strange and hard to reproduce issue when using kafka direct 
streaming, version 2.4.0
 From time to time, maybe once a day - once a week I get following error 
{noformat}
java.lang.IllegalArgumentException: requirement failed: numRecords must not be 
negative
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:250)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331)
at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:122)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:121)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:121)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
at 
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
19/01/29 13:10:00 ERROR apps.BusinessRuleEngine: Job failed. Stopping JVM
java.lang.IllegalArgumentException: requirement failed: numRecords must not be 
negative
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38)
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:250)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331)
at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:122)
at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:121)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.mu

[jira] [Commented] (SPARK-15544) Bouncing Zookeeper node causes Active spark master to exit

2019-03-26 Thread Moein Hosseini (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801484#comment-16801484
 ] 

Moein Hosseini commented on SPARK-15544:


[~kabhwan] I did some on this issue, but it's not well tested yet.

> Bouncing Zookeeper node causes Active spark master to exit
> --
>
> Key: SPARK-15544
> URL: https://issues.apache.org/jira/browse/SPARK-15544
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
> Environment: Ubuntu 14.04.  Zookeeper 3.4.6 with 3-node quorum
>Reporter: Steven Lowenthal
>Priority: Major
>
> Shutting Down a single zookeeper node caused spark master to exit.  The 
> master should have connected to a second zookeeper node. 
> {code:title=log output}
> 16/05/25 18:21:28 INFO master.Master: Launching executor 
> app-20160525182128-0006/1 on worker worker-20160524013212-10.16.28.76-59138
> 16/05/25 18:21:28 INFO master.Master: Launching executor 
> app-20160525182128-0006/2 on worker worker-20160524013204-10.16.21.217-47129
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data 
> from server sessionid 0x154dfc0426b0054, likely server has closed socket, 
> closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO zookeeper.ClientCnxn: Unable to read additional data 
> from server sessionid 0x254c701f28d0053, likely server has closed socket, 
> closing socket connection and attempting reconnect
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO state.ConnectionStateManager: State change: SUSPENDED
> 16/05/26 00:16:01 INFO master.ZooKeeperLeaderElectionAgent: We have lost 
> leadership
> 16/05/26 00:16:01 ERROR master.Master: Leadership has been revoked -- master 
> shutting down. }}
> {code}
> spark-env.sh: 
> {code:title=spark-env.sh}
> export SPARK_LOCAL_DIRS=/ephemeral/spark/local
> export SPARK_WORKER_DIR=/ephemeral/spark/work
> export SPARK_LOG_DIR=/var/log/spark
> export HADOOP_CONF_DIR=/home/ubuntu/hadoop-2.6.3/etc/hadoop
> export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER 
> -Dspark.deploy.zookeeper.url=gn5456-zookeeper-01:2181,gn5456-zookeeper-02:2181,gn5456-zookeeper-03:2181"
> export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27281) Wrong latest offsets returned by DirectKafkaInputDStream#latestOffsets

2019-03-26 Thread Viacheslav Krot (JIRA)
Viacheslav Krot created SPARK-27281:
---

 Summary: Wrong latest offsets returned by 
DirectKafkaInputDStream#latestOffsets
 Key: SPARK-27281
 URL: https://issues.apache.org/jira/browse/SPARK-27281
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.4.0
Reporter: Viacheslav Krot


I have a very strange and hard to reproduce issue when using kafka direct 
streaming, version 2.4.0
>From time to time, maybe once a day - once a week I get following error 
```
java.lang.IllegalArgumentException: requirement failed: numRecords must not be 
negative
 at scala.Predef$.require(Predef.scala:224)
 at 
org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38)
 at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:250)
 at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
 at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
 at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
 at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
 at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
 at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336)
 at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
 at scala.Option.orElse(Option.scala:289)
 at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331)
 at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
 at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:122)
 at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:121)
 at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
 at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
 at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
 at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
 at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:121)
 at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
 at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
 at scala.util.Try$.apply(Try.scala:192)
 at 
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
 at 
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
 at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
 at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
19/01/29 13:10:00 ERROR apps.BusinessRuleEngine: Job failed. Stopping JVM
java.lang.IllegalArgumentException: requirement failed: numRecords must not be 
negative
 at scala.Predef$.require(Predef.scala:224)
 at 
org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38)
 at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:250)
 at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
 at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
 at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
 at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
 at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
 at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336)
 at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
 at scala.Option.orElse(Option.scala:289)
 at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331)
 at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
 at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:122)
 at 
org.apache.spark.streaming.DStrea

[jira] [Reopened] (SPARK-27204) First time Loading application page from History Server is taking time when event log size is huge

2019-03-26 Thread ABHISHEK KUMAR GUPTA (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK KUMAR GUPTA reopened SPARK-27204:
--

> First time Loading application page from History Server is taking time when 
> event log size is huge
> --
>
> Key: SPARK-27204
> URL: https://issues.apache.org/jira/browse/SPARK-27204
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.3, 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> 1. Launch spark shell and submit a long running job.
> 2. Measure the loading time of Job History Page first time.
> 3. For Example Event Log Size = 18GB, With disk store, Application page 
> Loading time takes first time 47 Min



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27267) spark 2.4 use 1.1.7.x snappy-java, but its behavior is different from 1.1.2.x

2019-03-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27267:
-
Target Version/s:   (was: 2.4.0)

> spark 2.4 use  1.1.7.x  snappy-java, but its behavior is different from 
> 1.1.2.x 
> 
>
> Key: SPARK-27267
> URL: https://issues.apache.org/jira/browse/SPARK-27267
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.4.0
> Environment: spark.rdd.compress=true
> spark.io.compression.codec =snappy
> spark 2.4 in hadoop 2.6 with hive
>Reporter: Max  Xie
>Priority: Minor
>
> I use pyspark  like that
> ```
> from pyspark.storagelevel import StorageLevel
> df=spark.sql("select * from xzn.person")
> df.persist(StorageLevel(False, True, False, False))
> df.count()
> ```
> table person is a simple table stored as orc files and some orc files is 
> empty. When I run the query, it throw the error : 
> ```
> 19/03/22 21:46:31 INFO MemoryStore:54 - Block rdd_2_1 stored as values in 
> memory (estimated size 0.0 B, free 1662.6 MB)
> 19/03/22 21:46:31 INFO FileScanRDD:54 - Reading File path: 
> viewfs://name/xzn.db/person/part-00011, range: 0-49, partition values: [empty 
> row]
> 19/03/22 21:46:31 INFO FileScanRDD:54 - Reading File path: 
> viewfs://name/xzn.db/person/part-00011_copy_1, range: 0-49, partition values: 
> [empty row]
> 19/03/22 21:46:31 INFO FileScanRDD:54 - Reading File path: 
> viewfs://name/xzn.db/person/part-00012, range: 0-49, partition values: [empty 
> row]
> 19/03/22 21:46:31 INFO FileScanRDD:54 - Reading File path: 
> viewfs://name/xzn.db/person/part-00012_copy_1, range: 0-49, partition values: 
> [empty row]
> 19/03/22 21:46:31 INFO FileScanRDD:54 - Reading File path: 
> viewfs://name/xzn.db/person/part-00013, range: 0-49, partition values: [empty 
> row]
> 19/03/22 21:46:31 ERROR Executor:91 - Exception in task 1.0 in stage 0.0 (TID 
> 1)
> org.xerial.snappy.SnappyIOException: [EMPTY_INPUT] Cannot decompress empty 
> stream
>  at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:94)
>  at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:59)
>  at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:164)
>  at 
> org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:163)
>  at 
> org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:209)
>  at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:596)
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:886)
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
>  at org.apache.spark.scheduler.Task.run(Task.scala:121)
>  at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> ```
> After I search it, I find that 1.1.7.x  snappy-java 's behavior is different 
> from 1.1.2.x (that  spark 2.0.2 use this version). SnappyOutputStream in 
> 1.1.2.x version always writes a snappy header whether or not to write a 
> value,  but  SnappyOutputStream in 1.1.7.x don't generate header if u don't 
> write value into it, so in spark 2.4 if RDD cache a empty value, memoryStore 
> will not cache any bytes ( no snappy header ),  then it will throw the empty 
> error. 
>  
> Maybe we can change SnappyOutputSt

[jira] [Commented] (SPARK-27239) Processing Compressed HDFS files with spark failing with error: "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot be negative" from spark 2.2

2019-03-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801456#comment-16801456
 ] 

Hyukjin Kwon commented on SPARK-27239:
--

Doesn't matter which one is resolved as a duplicate.

> Processing Compressed HDFS files with spark failing with error: 
> "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot 
> be negative" from spark 2.2.X
> -
>
> Key: SPARK-27239
> URL: https://issues.apache.org/jira/browse/SPARK-27239
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Rajesh Kumar K
>Priority: Major
>
>  
> From spark 2.2.x versions, when spark job processing any compressed HDFS 
> files with custom input file format then spark jobs are failing with error 
> "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot 
> be negative", the custom input file format will return the number of bytes 
> length value as -1 for compressed file formats due to the compressed HDFS 
> file are non splitable, so for compressed input file format the split will be 
> offset as 0 and number of bytes length as -1, spark should consider the bytes 
> length value -1 as valid split for the compressed file formats.
>  
> We observed that earlier versions of spark doesn’t have this validation, and 
> found that from spark 2.2.x new validation got introduced in the class 
> InputFileBlockHolder, so spark should accept the number of bytes length value 
> -1 as valid length for input splits from spark 2.2.x as well.
>  
> +Below is the stack trace.+
>  Caused by: java.lang.IllegalArgumentException: requirement failed: length 
> (-1) cannot be negative
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.rdd.InputFileBlockHolder$.set(InputFileBlockHolder.scala:70)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:226)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>  
> +Below is the code snippet which caused this issue.+
>    **    {color:#ff}require(length >= 0, s"length ($length) cannot be 
> negative"){color} // This validation caused the issue. 
>  
> {code:java}
> // code placeholder
>  org.apache.spark.rdd.InputFileBlockHolder - spark-core
>  
> def set(filePath: String, startOffset: Long, length: Long): Unit = {
>     require(filePath != null, "filePath cannot be null")
>     require(startOffset >= 0, s"startOffset ($startOffset) cannot be 
> negative")
>     require(length >= 0, s"length ($length) cannot be negative")  
>     inputBlock.set(new FileBlock(UTF8String.fromString(filePath), 
> startOffset, length))
>   }
> {code}
>  
> +Steps to reproduce the issue.+
>  Please refer the below code to reproduce the issue.  
> {code:java}
> // code placeholder
> import org.apache.hadoop.mapred.JobConf
> val hadoopConf = new JobConf()
> import org.apache.hadoop.mapred.FileInputFormat
> import org.apache.hadoop.fs.Path
> FileInputFormat.setInputPaths(hadoopConf, new 
> Path("/output656/part-r-0.gz"))    
> val records = 
> sc.hadoopRDD(hadoopConf,classOf[com.platform.custom.storagehandler.INFAInputFormat],
>  classOf[org.apache.hadoop.io.LongWritable], 
> classOf[org.apache.hadoop.io.Writable]) 
> records.count()
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27239) Processing Compressed HDFS files with spark failing with error: "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot be negative" from spark 2.2.

2019-03-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27239.
--
Resolution: Duplicate

> Processing Compressed HDFS files with spark failing with error: 
> "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot 
> be negative" from spark 2.2.X
> -
>
> Key: SPARK-27239
> URL: https://issues.apache.org/jira/browse/SPARK-27239
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Rajesh Kumar K
>Priority: Major
>
>  
> From spark 2.2.x versions, when spark job processing any compressed HDFS 
> files with custom input file format then spark jobs are failing with error 
> "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot 
> be negative", the custom input file format will return the number of bytes 
> length value as -1 for compressed file formats due to the compressed HDFS 
> file are non splitable, so for compressed input file format the split will be 
> offset as 0 and number of bytes length as -1, spark should consider the bytes 
> length value -1 as valid split for the compressed file formats.
>  
> We observed that earlier versions of spark doesn’t have this validation, and 
> found that from spark 2.2.x new validation got introduced in the class 
> InputFileBlockHolder, so spark should accept the number of bytes length value 
> -1 as valid length for input splits from spark 2.2.x as well.
>  
> +Below is the stack trace.+
>  Caused by: java.lang.IllegalArgumentException: requirement failed: length 
> (-1) cannot be negative
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.rdd.InputFileBlockHolder$.set(InputFileBlockHolder.scala:70)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:226)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>  
> +Below is the code snippet which caused this issue.+
>    **    {color:#ff}require(length >= 0, s"length ($length) cannot be 
> negative"){color} // This validation caused the issue. 
>  
> {code:java}
> // code placeholder
>  org.apache.spark.rdd.InputFileBlockHolder - spark-core
>  
> def set(filePath: String, startOffset: Long, length: Long): Unit = {
>     require(filePath != null, "filePath cannot be null")
>     require(startOffset >= 0, s"startOffset ($startOffset) cannot be 
> negative")
>     require(length >= 0, s"length ($length) cannot be negative")  
>     inputBlock.set(new FileBlock(UTF8String.fromString(filePath), 
> startOffset, length))
>   }
> {code}
>  
> +Steps to reproduce the issue.+
>  Please refer the below code to reproduce the issue.  
> {code:java}
> // code placeholder
> import org.apache.hadoop.mapred.JobConf
> val hadoopConf = new JobConf()
> import org.apache.hadoop.mapred.FileInputFormat
> import org.apache.hadoop.fs.Path
> FileInputFormat.setInputPaths(hadoopConf, new 
> Path("/output656/part-r-0.gz"))    
> val records = 
> sc.hadoopRDD(hadoopConf,classOf[com.platform.custom.storagehandler.INFAInputFormat],
>  classOf[org.apache.hadoop.io.LongWritable], 
> classOf[org.apache.hadoop.io.Writable]) 
> records.count()
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27280) infer filters from Join's OR condition

2019-03-26 Thread Song Jun (JIRA)
Song Jun created SPARK-27280:


 Summary: infer filters from Join's OR condition
 Key: SPARK-27280
 URL: https://issues.apache.org/jira/browse/SPARK-27280
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer, SQL
Affects Versions: 3.0.0
Reporter: Song Jun


In some case, We can infer filters from Join condition with OR expressions.

for example, tpc-ds query 48:

{code:java}
select sum (ss_quantity)
 from store_sales, store, customer_demographics, customer_address, date_dim
 where s_store_sk = ss_store_sk
 and  ss_sold_date_sk = d_date_sk and d_year = 2000
 and  
 (
  (
   cd_demo_sk = ss_cdemo_sk
   and 
   cd_marital_status = 'S'
   and 
   cd_education_status = 'Secondary'
   and 
   ss_sales_price between 100.00 and 150.00  
   )
 or
  (
  cd_demo_sk = ss_cdemo_sk
   and 
   cd_marital_status = 'M'
   and 
   cd_education_status = 'College'
   and 
   ss_sales_price between 50.00 and 100.00   
  )
 or 
 (
  cd_demo_sk = ss_cdemo_sk
  and 
   cd_marital_status = 'U'
   and 
   cd_education_status = '2 yr Degree'
   and 
   ss_sales_price between 150.00 and 200.00  
 )
 )
 and
 (
  (
  ss_addr_sk = ca_address_sk
  and
  ca_country = 'United States'
  and
  ca_state in ('AL', 'OH', 'MD')
  and ss_net_profit between 0 and 2000  
  )
 or
  (ss_addr_sk = ca_address_sk
  and
  ca_country = 'United States'
  and
  ca_state in ('VA', 'TX', 'IA')
  and ss_net_profit between 150 and 3000 
  )
 or
  (ss_addr_sk = ca_address_sk
  and
  ca_country = 'United States'
  and
  ca_state in ('RI', 'WI', 'KY')
  and ss_net_profit between 50 and 25000 
  )
 )
;
{code}

we can infer two filters from the join or condidtion:

{code:java}
for customer_demographics:
cd_marital_status in(‘D',‘U',‘M') and cd_education_status in('4 yr 
Degree’,’Secondary’,’Primary')

for store_sales:
 (ss_sales_price between 100.00 and 150.00 or ss_sales_price between 50.00 and 
100.00 or ss_sales_price between 150.00 and 200.00)
{code}

then then we can push down the above two filters to filter  
customer_demographics/store_sales.

A pr will be submit soon.






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27239) Processing Compressed HDFS files with spark failing with error: "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot be negative" from spar

2019-03-26 Thread Rajesh Kumar K (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801449#comment-16801449
 ] 

Rajesh Kumar K edited comment on SPARK-27239 at 3/26/19 7:13 AM:
-

[~hyukjin.kwon]

you resolved my Jira instead of the Jira : SPARK-27259 which exactly cloned 
from this Jira, so please use this Jira :*SPARK-27239* to track the issue. 


was (Author: rkinthali):
[~hyukjin.kwon]

you resolved my Jira instead of the Jira : SPARK-27259 which exactly cloned 
from this Jira, so please use this Jira to track the issue. 

> Processing Compressed HDFS files with spark failing with error: 
> "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot 
> be negative" from spark 2.2.X
> -
>
> Key: SPARK-27239
> URL: https://issues.apache.org/jira/browse/SPARK-27239
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Rajesh Kumar K
>Priority: Major
>
>  
> From spark 2.2.x versions, when spark job processing any compressed HDFS 
> files with custom input file format then spark jobs are failing with error 
> "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot 
> be negative", the custom input file format will return the number of bytes 
> length value as -1 for compressed file formats due to the compressed HDFS 
> file are non splitable, so for compressed input file format the split will be 
> offset as 0 and number of bytes length as -1, spark should consider the bytes 
> length value -1 as valid split for the compressed file formats.
>  
> We observed that earlier versions of spark doesn’t have this validation, and 
> found that from spark 2.2.x new validation got introduced in the class 
> InputFileBlockHolder, so spark should accept the number of bytes length value 
> -1 as valid length for input splits from spark 2.2.x as well.
>  
> +Below is the stack trace.+
>  Caused by: java.lang.IllegalArgumentException: requirement failed: length 
> (-1) cannot be negative
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.rdd.InputFileBlockHolder$.set(InputFileBlockHolder.scala:70)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:226)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>  
> +Below is the code snippet which caused this issue.+
>    **    {color:#ff}require(length >= 0, s"length ($length) cannot be 
> negative"){color} // This validation caused the issue. 
>  
> {code:java}
> // code placeholder
>  org.apache.spark.rdd.InputFileBlockHolder - spark-core
>  
> def set(filePath: String, startOffset: Long, length: Long): Unit = {
>     require(filePath != null, "filePath cannot be null")
>     require(startOffset >= 0, s"startOffset ($startOffset) cannot be 
> negative")
>     require(length >= 0, s"length ($length) cannot be negative")  
>     inputBlock.set(new FileBlock(UTF8String.fromString(filePath), 
> startOffset, length))
>   }
> {code}
>  
> +Steps to reproduce the issue.+
>  Please refer the below code to reproduce the issue.  
> {code:java}
> // code placeholder
> import org.apache.hadoop.mapred.JobConf
> val hadoopConf = new JobConf()
> import org.apache.hadoop.mapred.FileInputFormat
> import org.apache.hadoop.fs.Path
> FileInputFormat.setInputPaths(hadoopConf, new 
> Path("/output656/part-r-0.gz"))    
> val records = 
> sc.hadoopRDD(hadoopConf,classOf[com.platform.custom.storagehandler.INFAInputFormat],
>  classOf[org.apache.hadoop.io.LongWritable], 
> classOf[org.apache.hadoop.io.Writable]) 
> records.count()
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27239) Processing Compressed HDFS files with spark failing with error: "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot be negative" from spark 2.2

2019-03-26 Thread Rajesh Kumar K (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801449#comment-16801449
 ] 

Rajesh Kumar K commented on SPARK-27239:


[~hyukjin.kwon]

you resolved my Jira instead of the Jira : SPARK-27259 which exactly cloned 
from this Jira, so please use this Jira to track the issue. 

> Processing Compressed HDFS files with spark failing with error: 
> "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot 
> be negative" from spark 2.2.X
> -
>
> Key: SPARK-27239
> URL: https://issues.apache.org/jira/browse/SPARK-27239
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Rajesh Kumar K
>Priority: Major
>
>  
> From spark 2.2.x versions, when spark job processing any compressed HDFS 
> files with custom input file format then spark jobs are failing with error 
> "java.lang.IllegalArgumentException: requirement failed: length (-1) cannot 
> be negative", the custom input file format will return the number of bytes 
> length value as -1 for compressed file formats due to the compressed HDFS 
> file are non splitable, so for compressed input file format the split will be 
> offset as 0 and number of bytes length as -1, spark should consider the bytes 
> length value -1 as valid split for the compressed file formats.
>  
> We observed that earlier versions of spark doesn’t have this validation, and 
> found that from spark 2.2.x new validation got introduced in the class 
> InputFileBlockHolder, so spark should accept the number of bytes length value 
> -1 as valid length for input splits from spark 2.2.x as well.
>  
> +Below is the stack trace.+
>  Caused by: java.lang.IllegalArgumentException: requirement failed: length 
> (-1) cannot be negative
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.rdd.InputFileBlockHolder$.set(InputFileBlockHolder.scala:70)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:226)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>  
> +Below is the code snippet which caused this issue.+
>    **    {color:#ff}require(length >= 0, s"length ($length) cannot be 
> negative"){color} // This validation caused the issue. 
>  
> {code:java}
> // code placeholder
>  org.apache.spark.rdd.InputFileBlockHolder - spark-core
>  
> def set(filePath: String, startOffset: Long, length: Long): Unit = {
>     require(filePath != null, "filePath cannot be null")
>     require(startOffset >= 0, s"startOffset ($startOffset) cannot be 
> negative")
>     require(length >= 0, s"length ($length) cannot be negative")  
>     inputBlock.set(new FileBlock(UTF8String.fromString(filePath), 
> startOffset, length))
>   }
> {code}
>  
> +Steps to reproduce the issue.+
>  Please refer the below code to reproduce the issue.  
> {code:java}
> // code placeholder
> import org.apache.hadoop.mapred.JobConf
> val hadoopConf = new JobConf()
> import org.apache.hadoop.mapred.FileInputFormat
> import org.apache.hadoop.fs.Path
> FileInputFormat.setInputPaths(hadoopConf, new 
> Path("/output656/part-r-0.gz"))    
> val records = 
> sc.hadoopRDD(hadoopConf,classOf[com.platform.custom.storagehandler.INFAInputFormat],
>  classOf[org.apache.hadoop.io.LongWritable], 
> classOf[org.apache.hadoop.io.Writable]) 
> records.count()
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org