[jira] [Updated] (SPARK-25427) Add BloomFilter creation test cases

2018-09-13 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25427:
--
Component/s: Tests

> Add BloomFilter creation test cases
> ---
>
> Key: SPARK-25427
> URL: https://issues.apache.org/jira/browse/SPARK-25427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Spark supports BloomFilter creation for ORC files. This issue aims to add 
> test coverages to prevent regressions like SPARK-12417



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25418) The metadata of DataSource table should not include Hive-generated storage properties.

2018-09-13 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25418.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> The metadata of DataSource table should not include Hive-generated storage 
> properties.
> --
>
> Key: SPARK-25418
> URL: https://issues.apache.org/jira/browse/SPARK-25418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Priority: Major
> Fix For: 3.0.0
>
>
> When Hive support enabled, Hive catalog puts extra storage properties into 
> table metadata even for DataSource tables, but we should not have them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25418) The metadata of DataSource table should not include Hive-generated storage properties.

2018-09-13 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25418:
---

Assignee: Takuya Ueshin

> The metadata of DataSource table should not include Hive-generated storage 
> properties.
> --
>
> Key: SPARK-25418
> URL: https://issues.apache.org/jira/browse/SPARK-25418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.0.0
>
>
> When Hive support enabled, Hive catalog puts extra storage properties into 
> table metadata even for DataSource tables, but we should not have them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25427) Add BloomFilter creation test cases

2018-09-13 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-25427:
-

 Summary: Add BloomFilter creation test cases
 Key: SPARK-25427
 URL: https://issues.apache.org/jira/browse/SPARK-25427
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2, 2.4.0
Reporter: Dongjoon Hyun


Spark supports BloomFilter creation for ORC files. This issue aims to add test 
coverages to prevent regressions like SPARK-12417



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24498) Add JDK compiler for runtime codegen

2018-09-13 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24498:

Target Version/s: 3.0.0

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23906) Add UDF trunc(numeric)

2018-09-13 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614307#comment-16614307
 ] 

Yuming Wang commented on SPARK-23906:
-

cc [~dongjoon]

It's difficult to reuse {{trunc}} for truncating numbers. How about introduce a 
new name: {{truncate}}?
{code:sql}
mysql> SELECT TRUNCATE(1.223,1);
-> 1.2
mysql> SELECT TRUNCATE(1.999,1);
-> 1.9
mysql> SELECT TRUNCATE(1.999,0);
-> 1
mysql> SELECT TRUNCATE(-1.999,1);
-> -1.9
mysql> SELECT TRUNCATE(122,-2);
   -> 100
mysql> SELECT TRUNCATE(10.28*100,0);
   -> 1028
{code}
[https://dev.mysql.com/doc/refman/8.0/en/mathematical-functions.html#function_truncate]

> Add UDF trunc(numeric)
> --
>
> Key: SPARK-23906
> URL: https://issues.apache.org/jira/browse/SPARK-23906
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-14582
> We already have {{date_trunc}} and {{trunc}}. Need to discuss whether we 
> should introduce a new name or reuse {{trunc}} for truncating numbers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25426) Handles subexpression elimination config inside CodeGeneratorWithInterpretedFallback

2018-09-13 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-25426:


 Summary: Handles subexpression elimination config inside 
CodeGeneratorWithInterpretedFallback
 Key: SPARK-25426
 URL: https://issues.apache.org/jira/browse/SPARK-25426
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Takeshi Yamamuro






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25414) make it clear that the numRows metrics should be counted for each scan of the source

2018-09-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25414:

Summary: make it clear that the numRows metrics should be counted for each 
scan of the source  (was: The numInputRows metrics can be incorrect for 
streaming self-join)

> make it clear that the numRows metrics should be counted for each scan of the 
> source
> 
>
> Key: SPARK-25414
> URL: https://issues.apache.org/jira/browse/SPARK-25414
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25414) make it clear that the numRows metrics should be counted for each scan of the source

2018-09-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25414:

Issue Type: Test  (was: Bug)

> make it clear that the numRows metrics should be counted for each scan of the 
> source
> 
>
> Key: SPARK-25414
> URL: https://issues.apache.org/jira/browse/SPARK-25414
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25293) Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir

2018-09-13 Thread omkar puttagunta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614245#comment-16614245
 ] 

omkar puttagunta edited comment on SPARK-25293 at 9/14/18 2:03 AM:
---

[~hyukjin.kwon] tested with 2.1.3, got the  same issue. My stack overflow 
question got answers saying that this is due to lack of  "shared file system". 
Is it the real reason?

I am running spark in standalone mode, no HDFS, or any other distributed file 
system

If I use the fileOutputCommiter Version 2, will I get the desired result?

 

 

[https://stackoverflow.com/questions/52089208/spark-dataframe-write-to-csv-creates-temporary-directory-file-in-standalone-clu]

 


was (Author: omkar999):
[~hyukjin.kwon] tested with 2.1.3, got the  same issue. My stack overflow 
question got answers saying that this is due to lack of  "shared file system". 
Is it the real reason?

If I use the fileOutputCommiter Version 2, will I get the desired result?

 

 

[https://stackoverflow.com/questions/52089208/spark-dataframe-write-to-csv-creates-temporary-directory-file-in-standalone-clu]

 

> Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx 
> instead of directly saving in outputDir
> --
>
> Key: SPARK-25293
> URL: https://issues.apache.org/jira/browse/SPARK-25293
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Java API, Spark Shell, Spark Submit
>Affects Versions: 2.0.2, 2.1.3
>Reporter: omkar puttagunta
>Priority: Major
>
> [https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating]
> {quote}Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master 
> node on AWS EC2
> {quote}
> Simple Test; reading pipe delimited file and writing data to csv. Commands 
> below are executed in spark-shell with master-url set
> {{val df = 
> spark.sqlContext.read.option("delimiter","|").option("quote","\u").csv("/home/input-files/")
>  val emailDf=df.filter("_c3='EML'") 
> emailDf.repartition(100).write.csv("/opt/outputFile/")}}
> After executing the cmds above in spark-shell with master url set.
> {quote}In {{worker1}} -> Each part file is created 
> in\{{/opt/outputFile/_temporary/task-x-xxx/part-xxx-xxx}}
>  In {{worker2}} -> {{/opt/outputFile/part-xxx}} => part files are generated 
> directly under outputDirectory specified during write.
> {quote}
> *Same thing happens with coalesce(100) or without specifying 
> repartition/coalesce!!! Tried with Java also!*
> *_Quesiton_*
> 1) why {{worker1}} {{/opt/outputFile/}} output directory doesn't have 
> {{part-}} files just like in {{worker2}}? why {{_temporary}} directory is 
> created and {{part-xxx-xx}} files reside in the \{{task-xxx}}directories?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25293) Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir

2018-09-13 Thread omkar puttagunta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614245#comment-16614245
 ] 

omkar puttagunta commented on SPARK-25293:
--

[~hyukjin.kwon] tested with 2.1.3, got the  same issue. My stack overflow 
question got answers saying that this is due to lack of  "shared file system". 
Is it the real reason?

If I use the fileOutputCommiter Version 2, will I get the desired result?

 

 

[https://stackoverflow.com/questions/52089208/spark-dataframe-write-to-csv-creates-temporary-directory-file-in-standalone-clu]

 

> Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx 
> instead of directly saving in outputDir
> --
>
> Key: SPARK-25293
> URL: https://issues.apache.org/jira/browse/SPARK-25293
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Java API, Spark Shell, Spark Submit
>Affects Versions: 2.0.2, 2.1.3
>Reporter: omkar puttagunta
>Priority: Major
>
> [https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating]
> {quote}Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master 
> node on AWS EC2
> {quote}
> Simple Test; reading pipe delimited file and writing data to csv. Commands 
> below are executed in spark-shell with master-url set
> {{val df = 
> spark.sqlContext.read.option("delimiter","|").option("quote","\u").csv("/home/input-files/")
>  val emailDf=df.filter("_c3='EML'") 
> emailDf.repartition(100).write.csv("/opt/outputFile/")}}
> After executing the cmds above in spark-shell with master url set.
> {quote}In {{worker1}} -> Each part file is created 
> in\{{/opt/outputFile/_temporary/task-x-xxx/part-xxx-xxx}}
>  In {{worker2}} -> {{/opt/outputFile/part-xxx}} => part files are generated 
> directly under outputDirectory specified during write.
> {quote}
> *Same thing happens with coalesce(100) or without specifying 
> repartition/coalesce!!! Tried with Java also!*
> *_Quesiton_*
> 1) why {{worker1}} {{/opt/outputFile/}} output directory doesn't have 
> {{part-}} files just like in {{worker2}}? why {{_temporary}} directory is 
> created and {{part-xxx-xx}} files reside in the \{{task-xxx}}directories?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25293) Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir

2018-09-13 Thread omkar puttagunta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

omkar puttagunta updated SPARK-25293:
-
Affects Version/s: 2.1.3

> Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx 
> instead of directly saving in outputDir
> --
>
> Key: SPARK-25293
> URL: https://issues.apache.org/jira/browse/SPARK-25293
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Java API, Spark Shell, Spark Submit
>Affects Versions: 2.0.2, 2.1.3
>Reporter: omkar puttagunta
>Priority: Major
>
> [https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating]
> {quote}Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master 
> node on AWS EC2
> {quote}
> Simple Test; reading pipe delimited file and writing data to csv. Commands 
> below are executed in spark-shell with master-url set
> {{val df = 
> spark.sqlContext.read.option("delimiter","|").option("quote","\u").csv("/home/input-files/")
>  val emailDf=df.filter("_c3='EML'") 
> emailDf.repartition(100).write.csv("/opt/outputFile/")}}
> After executing the cmds above in spark-shell with master url set.
> {quote}In {{worker1}} -> Each part file is created 
> in\{{/opt/outputFile/_temporary/task-x-xxx/part-xxx-xxx}}
>  In {{worker2}} -> {{/opt/outputFile/part-xxx}} => part files are generated 
> directly under outputDirectory specified during write.
> {quote}
> *Same thing happens with coalesce(100) or without specifying 
> repartition/coalesce!!! Tried with Java also!*
> *_Quesiton_*
> 1) why {{worker1}} {{/opt/outputFile/}} output directory doesn't have 
> {{part-}} files just like in {{worker2}}? why {{_temporary}} directory is 
> created and {{part-xxx-xx}} files reside in the \{{task-xxx}}directories?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23200) Reset configuration when restarting from checkpoints

2018-09-13 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614208#comment-16614208
 ] 

Stavros Kontopoulos edited comment on SPARK-23200 at 9/14/18 1:07 AM:
--

[~cloud_fan]I think this should go in 2.4 even though its a bit late, the 
remaining PR is trivial. A few properties need to be restored from the 
checkpoint, and of course it needs testing. I can do the testing if we can get 
it in 2.4 soon. [~foxish] thoughts?


was (Author: skonto):
[~cloud_fan]I think this should go in 2.4 even though its a bit late, the 
remaining PR is trivial. A few properties need to be restored from the 
checkpoint, and of course it needs testing. I can do the testing if we can get 
it in 2.4 soon.

> Reset configuration when restarting from checkpoints
> 
>
> Key: SPARK-23200
> URL: https://issues.apache.org/jira/browse/SPARK-23200
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> Streaming workloads and restarting from checkpoints may need additional 
> changes, i.e. resetting properties -  see 
> https://github.com/apache-spark-on-k8s/spark/pull/516



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23200) Reset configuration when restarting from checkpoints

2018-09-13 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614208#comment-16614208
 ] 

Stavros Kontopoulos edited comment on SPARK-23200 at 9/14/18 1:04 AM:
--

[~cloud_fan]I think this should go in 2.4 even though its a bit late, the 
remaining PR is trivial. A few properties need to be restored from the 
checkpoint, and of course it needs testing. I can do the testing if we can get 
it in 2.4 soon.


was (Author: skonto):
[~cloud_fan]I think this should go in 2.4 even though its a bit late, the 
remaining PR is trivial. A few properties need to be restored from the 
checkpoint, and of course it needs testing. I can do it if we can get it in 2.4 
soon.

> Reset configuration when restarting from checkpoints
> 
>
> Key: SPARK-23200
> URL: https://issues.apache.org/jira/browse/SPARK-23200
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> Streaming workloads and restarting from checkpoints may need additional 
> changes, i.e. resetting properties -  see 
> https://github.com/apache-spark-on-k8s/spark/pull/516



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23200) Reset configuration when restarting from checkpoints

2018-09-13 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614208#comment-16614208
 ] 

Stavros Kontopoulos edited comment on SPARK-23200 at 9/14/18 1:03 AM:
--

[~cloud_fan]I think this should go in 2.4 even though its a bit late, the 
remaining PR is trivial. A few properties need to be restored from the 
checkpoint, and of course it needs testing. I can do it if we can get it in 2.4 
soon.


was (Author: skonto):
[~cloud_fan]I think this should go in 2.4 even though its a bit late, the 
remaining PR is trivial s few properties need to be restored from the 
checkpoint, and of course it needs testing. I can do it if we can get it in 2.4 
soon.

> Reset configuration when restarting from checkpoints
> 
>
> Key: SPARK-23200
> URL: https://issues.apache.org/jira/browse/SPARK-23200
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> Streaming workloads and restarting from checkpoints may need additional 
> changes, i.e. resetting properties -  see 
> https://github.com/apache-spark-on-k8s/spark/pull/516



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23200) Reset configuration when restarting from checkpoints

2018-09-13 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614208#comment-16614208
 ] 

Stavros Kontopoulos edited comment on SPARK-23200 at 9/14/18 1:03 AM:
--

[~cloud_fan]I think this should go in 2.4 even though its a bit late, the 
remaining PR is trivial s few properties need to be restored from the 
checkpoint, and of course it needs testing. I can do it if we can get it in 2.4 
soon.


was (Author: skonto):
[~cloud_fan]I think this should go in 2.4 even though its a bit late.

> Reset configuration when restarting from checkpoints
> 
>
> Key: SPARK-23200
> URL: https://issues.apache.org/jira/browse/SPARK-23200
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> Streaming workloads and restarting from checkpoints may need additional 
> changes, i.e. resetting properties -  see 
> https://github.com/apache-spark-on-k8s/spark/pull/516



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23200) Reset configuration when restarting from checkpoints

2018-09-13 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614208#comment-16614208
 ] 

Stavros Kontopoulos commented on SPARK-23200:
-

[~cloud_fan]I think this should go in 2.4 even though its a bit late.

> Reset configuration when restarting from checkpoints
> 
>
> Key: SPARK-23200
> URL: https://issues.apache.org/jira/browse/SPARK-23200
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> Streaming workloads and restarting from checkpoints may need additional 
> changes, i.e. resetting properties -  see 
> https://github.com/apache-spark-on-k8s/spark/pull/516



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23200) Reset configuration when restarting from checkpoints

2018-09-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-23200:

Issue Type: Bug  (was: Improvement)

> Reset configuration when restarting from checkpoints
> 
>
> Key: SPARK-23200
> URL: https://issues.apache.org/jira/browse/SPARK-23200
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> Streaming workloads and restarting from checkpoints may need additional 
> changes, i.e. resetting properties -  see 
> https://github.com/apache-spark-on-k8s/spark/pull/516



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4

2018-09-13 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614200#comment-16614200
 ] 

Liang-Chi Hsieh commented on SPARK-25378:
-

The fix looks like: 
https://github.com/apache/spark/compare/master...viirya:SPARK-25378?expand=1

If this looks ok, I can submit a PR with it.

cc [~mengxr] [~cloud_fan] [~hyukjin.kwon]. Please let me know. Thanks.

> ArrayData.toArray(StringType) assume UTF8String in 2.4
> --
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23200) Reset configuration when restarting from checkpoints

2018-09-13 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614196#comment-16614196
 ] 

Stavros Kontopoulos edited comment on SPARK-23200 at 9/14/18 12:45 AM:
---

This is important and should have been a bug not an improvement. Checkpointing 
needs to work as it is used by default in many cases in production. We should 
have an integration test for it.

 


was (Author: skonto):
This is important and should have been a bug not an improvement. Checkpointing 
needs to work as it is used by default in many cases in production.

 

> Reset configuration when restarting from checkpoints
> 
>
> Key: SPARK-23200
> URL: https://issues.apache.org/jira/browse/SPARK-23200
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> Streaming workloads and restarting from checkpoints may need additional 
> changes, i.e. resetting properties -  see 
> https://github.com/apache-spark-on-k8s/spark/pull/516



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23200) Reset configuration when restarting from checkpoints

2018-09-13 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614196#comment-16614196
 ] 

Stavros Kontopoulos edited comment on SPARK-23200 at 9/14/18 12:44 AM:
---

This is important and should have been a bug not an improvement. Checkpointing 
needs to work as it is used by default in many cases in production.

 


was (Author: skonto):
This is important and should have been a bug not an improvement. Checkpointing 
needs to work.

 

> Reset configuration when restarting from checkpoints
> 
>
> Key: SPARK-23200
> URL: https://issues.apache.org/jira/browse/SPARK-23200
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> Streaming workloads and restarting from checkpoints may need additional 
> changes, i.e. resetting properties -  see 
> https://github.com/apache-spark-on-k8s/spark/pull/516



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23200) Reset configuration when restarting from checkpoints

2018-09-13 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614196#comment-16614196
 ] 

Stavros Kontopoulos commented on SPARK-23200:
-

This is important and should have been a bug not an improvement. Checkpointing 
needs to work.

 

> Reset configuration when restarting from checkpoints
> 
>
> Key: SPARK-23200
> URL: https://issues.apache.org/jira/browse/SPARK-23200
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> Streaming workloads and restarting from checkpoints may need additional 
> changes, i.e. resetting properties -  see 
> https://github.com/apache-spark-on-k8s/spark/pull/516



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25344) Break large tests.py files into smaller files

2018-09-13 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614176#comment-16614176
 ] 

Bryan Cutler commented on SPARK-25344:
--

>From the mailing list I think we should agree on a few things first:

1. When to create a separate test file, for each module? and how to name? e.g. 
"test_rdd.py"
2. Where to put the test files? same dir as source or subdir named "tests"
3. Start splitting tests immediately as new tests are written? Incrementally as 
subtasks in this JIRA?

> Break large tests.py files into smaller files
> -
>
> Key: SPARK-25344
> URL: https://issues.apache.org/jira/browse/SPARK-25344
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>
> We've got a ton of tests in one humongous tests.py file, rather than breaking 
> it out into smaller files.
> Having one huge file doesn't seem great for code organization, and it also 
> makes the test parallelization in run-tests.py not work as well.  On my 
> laptop, tests.py takes 150s, and the next longest test file takes only 20s.  
> There are similarly large files in other pyspark modules, eg. sql/tests.py, 
> ml/tests.py, mllib/tests.py, streaming/tests.py.
> It seems that at least for some of these files, its already broken into 
> independent test classes, so it shouldn't be too hard to just move them into 
> their own files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25053) Allow additional port forwarding on Spark on K8S as needed

2018-09-13 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614111#comment-16614111
 ] 

Stavros Kontopoulos commented on SPARK-25053:
-

This is going to be covered by the pod template PR, checked it already with 
jprofiler and works as expected.

> Allow additional port forwarding on Spark on K8S as needed
> --
>
> Key: SPARK-25053
> URL: https://issues.apache.org/jira/browse/SPARK-25053
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
> In some cases, like setting up remote debuggers, adding additional ports to 
> be forwarded would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata

2018-09-13 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25423:

Labels: starter  (was: )

> Output "dataFilters" in DataSourceScanExec.metadata
> ---
>
> Key: SPARK-25423
> URL: https://issues.apache.org/jira/browse/SPARK-25423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Priority: Trivial
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25164) Parquet reader builds entire list of columns once for each column

2018-09-13 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614020#comment-16614020
 ] 

Ruslan Dautkhanov commented on SPARK-25164:
---

Hi [~bersprockets]

 

Thanks a lot for the detailed response.

I totally see with what you're saying.

That's interesting that Spark realizing all rows even though where filter has a 
predicate for just one column.

I am thinking if it's feasible to lazily realize list of columns in 
select-clause only after filtering is complete?

It seems could be a huge performance improvement for wider tables like this.

In other words, if Spark would realize list of columns specified in where 
clause first, and only after filtering 
realize rest of columns needed for select-clause.

Thoughts? 

Thank you!
Ruslan

 

> Parquet reader builds entire list of columns once for each column
> -
>
> Key: SPARK-25164
> URL: https://issues.apache.org/jira/browse/SPARK-25164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> {{VectorizedParquetRecordReader.initializeInternal}} loops through each 
> column, and for each column it calls
> {noformat}
> requestedSchema.getColumns().get(i)
> {noformat}
> However, {{MessageType.getColumns}} will build the entire column list from 
> getPaths(0).
> {noformat}
>   public List getColumns() {
> List paths = this.getPaths(0);
> List columns = new 
> ArrayList(paths.size());
> for (String[] path : paths) {
>   // TODO: optimize this  
>   
>   PrimitiveType primitiveType = getType(path).asPrimitiveType();
>   columns.add(new ColumnDescriptor(
>   path,
>   primitiveType,
>   getMaxRepetitionLevel(path),
>   getMaxDefinitionLevel(path)));
> }
> return columns;
>   }
> {noformat}
> This means that for each parquet file, this routine indirectly iterates 
> colCount*colCount times.
> This is actually not particularly noticeable unless you have:
>  - many parquet files
>  - many columns
> To verify that this is an issue, I created a 1 million record parquet table 
> with 6000 columns of type double and 67 files (so initializeInternal is 
> called 67 times). I ran the following query:
> {noformat}
> sql("select * from 6000_1m_double where id1 = 1").collect
> {noformat}
> I used Spark from the master branch. I had 8 executor threads. The filter 
> returns only a few thousand records. The query ran (on average) for 6.4 
> minutes.
> Then I cached the column list at the top of {{initializeInternal}} as follows:
> {noformat}
> List columnCache = requestedSchema.getColumns();
> {noformat}
> Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
> {{requestedSchema.getColumns()}}.
> With the column cache variable, the same query runs in 5 minutes. So with my 
> simple query, you save %22 of time by not rebuilding the column list for each 
> column.
> You get additional savings with a paths cache variable, now saving 34% in 
> total on the above query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25164) Parquet reader builds entire list of columns once for each column

2018-09-13 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614020#comment-16614020
 ] 

Ruslan Dautkhanov edited comment on SPARK-25164 at 9/13/18 8:19 PM:


Hi [~bersprockets]

Thanks a lot for the detailed response.

I totally see with what you're saying.

That's interesting that Spark realizing all rows even though where filter has a 
predicate for just one column.

I am thinking if it's feasible to lazily realize list of columns in 
select-clause only after filtering is complete?

It seems could be a huge performance improvement for wider tables like this.

In other words, if Spark would realize list of columns specified in where 
clause first, and only after filtering 
realize rest of columns needed for select-clause.

Thoughts? 

Thank you!
Ruslan

 


was (Author: tagar):
Hi [~bersprockets]

 

Thanks a lot for the detailed response.

I totally see with what you're saying.

That's interesting that Spark realizing all rows even though where filter has a 
predicate for just one column.

I am thinking if it's feasible to lazily realize list of columns in 
select-clause only after filtering is complete?

It seems could be a huge performance improvement for wider tables like this.

In other words, if Spark would realize list of columns specified in where 
clause first, and only after filtering 
realize rest of columns needed for select-clause.

Thoughts? 

Thank you!
Ruslan

 

> Parquet reader builds entire list of columns once for each column
> -
>
> Key: SPARK-25164
> URL: https://issues.apache.org/jira/browse/SPARK-25164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> {{VectorizedParquetRecordReader.initializeInternal}} loops through each 
> column, and for each column it calls
> {noformat}
> requestedSchema.getColumns().get(i)
> {noformat}
> However, {{MessageType.getColumns}} will build the entire column list from 
> getPaths(0).
> {noformat}
>   public List getColumns() {
> List paths = this.getPaths(0);
> List columns = new 
> ArrayList(paths.size());
> for (String[] path : paths) {
>   // TODO: optimize this  
>   
>   PrimitiveType primitiveType = getType(path).asPrimitiveType();
>   columns.add(new ColumnDescriptor(
>   path,
>   primitiveType,
>   getMaxRepetitionLevel(path),
>   getMaxDefinitionLevel(path)));
> }
> return columns;
>   }
> {noformat}
> This means that for each parquet file, this routine indirectly iterates 
> colCount*colCount times.
> This is actually not particularly noticeable unless you have:
>  - many parquet files
>  - many columns
> To verify that this is an issue, I created a 1 million record parquet table 
> with 6000 columns of type double and 67 files (so initializeInternal is 
> called 67 times). I ran the following query:
> {noformat}
> sql("select * from 6000_1m_double where id1 = 1").collect
> {noformat}
> I used Spark from the master branch. I had 8 executor threads. The filter 
> returns only a few thousand records. The query ran (on average) for 6.4 
> minutes.
> Then I cached the column list at the top of {{initializeInternal}} as follows:
> {noformat}
> List columnCache = requestedSchema.getColumns();
> {noformat}
> Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
> {{requestedSchema.getColumns()}}.
> With the column cache variable, the same query runs in 5 minutes. So with my 
> simple query, you save %22 of time by not rebuilding the column list for each 
> column.
> You get additional savings with a paths cache variable, now saving 34% in 
> total on the above query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25425) Extra options must overwrite sessions options

2018-09-13 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25425:
--

 Summary: Extra options must overwrite sessions options
 Key: SPARK-25425
 URL: https://issues.apache.org/jira/browse/SPARK-25425
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


In load() and save() methods of DataSource V2, extra options are overwritten by 
session options:
* 
https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L244-L245
* 
https://github.com/apache/spark/blob/c9cb393dc414ae98093c1541d09fa3c8663ce276/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L205

but implementation must be opposite - more specific extra options set via 
*.option(...)* must overwrite more common session options




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21291) R bucketBy partitionBy API

2018-09-13 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613994#comment-16613994
 ] 

Felix Cheung commented on SPARK-21291:
--

No, you wouldn’t return a writer in R. I will reply with more details in a few 
days




> R bucketBy partitionBy API
> --
>
> Key: SPARK-21291
> URL: https://issues.apache.org/jira/browse/SPARK-21291
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Priority: Major
>
> partitionBy exists but it's for windowspec only



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25400) Increase timeouts in schedulerIntegrationSuite

2018-09-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25400:
-

Assignee: Imran Rashid

> Increase timeouts in schedulerIntegrationSuite
> --
>
> Key: SPARK-25400
> URL: https://issues.apache.org/jira/browse/SPARK-25400
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> I just took a look at a flaky failure in {{SchedulerIntegrationSuite}} 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95887
>  it seems the timeout really is too short:
> {noformat}
> 18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task 
> 5.0 in stage 1.0 (TID 8, localhost, executor driver, partition 5, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:07.821 task-result-getter-2 INFO TaskSetManager: Finished task 
> 3.0 in stage 1.0 (TID 6) in 1 ms on localhost (executor driver) (4/10)
> 18/09/10 11:14:07.821 task-result-getter-0 INFO TaskSetManager: Finished task 
> 4.0 in stage 1.0 (TID 7) in 1 ms on localhost (executor driver) (5/10)
> 18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task 
> 6.0 in stage 1.0 (TID 9, localhost, executor driver, partition 6, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:07.821 task-result-getter-1 INFO TaskSetManager: Finished task 
> 5.0 in stage 1.0 (TID 8) in 0 ms on localhost (executor driver) (6/10)
> 18/09/10 11:14:09.481 mock backend thread INFO TaskSetManager: Starting task 
> 7.0 in stage 1.0 (TID 10, localhost, executor driver, partition 7, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:09.482 dispatcher-event-loop-14 INFO BlockManagerInfo: Removed 
> broadcast_0_piece0 on amp-jenkins-worker-05.amp:36913 in memory (size: 1260.0 
> B, free: 1638.6 MB)
> {noformat}
> you'll see that the "mock backend thread" does keep making progress, but for 
> whatever reason there is over a one second delay in the middle.  Thats 
> already going over the existing timeouts.
> Its possible there is something else going on here, but for now just 
> increasing the timeouts seems like the best next step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25400) Increase timeouts in schedulerIntegrationSuite

2018-09-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25400.
---
   Resolution: Fixed
Fix Version/s: 2.3.2
   2.4.0

Issue resolved by pull request 22385
[https://github.com/apache/spark/pull/22385]

> Increase timeouts in schedulerIntegrationSuite
> --
>
> Key: SPARK-25400
> URL: https://issues.apache.org/jira/browse/SPARK-25400
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0, 2.3.2
>
>
> I just took a look at a flaky failure in {{SchedulerIntegrationSuite}} 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95887
>  it seems the timeout really is too short:
> {noformat}
> 18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task 
> 5.0 in stage 1.0 (TID 8, localhost, executor driver, partition 5, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:07.821 task-result-getter-2 INFO TaskSetManager: Finished task 
> 3.0 in stage 1.0 (TID 6) in 1 ms on localhost (executor driver) (4/10)
> 18/09/10 11:14:07.821 task-result-getter-0 INFO TaskSetManager: Finished task 
> 4.0 in stage 1.0 (TID 7) in 1 ms on localhost (executor driver) (5/10)
> 18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task 
> 6.0 in stage 1.0 (TID 9, localhost, executor driver, partition 6, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:07.821 task-result-getter-1 INFO TaskSetManager: Finished task 
> 5.0 in stage 1.0 (TID 8) in 0 ms on localhost (executor driver) (6/10)
> 18/09/10 11:14:09.481 mock backend thread INFO TaskSetManager: Starting task 
> 7.0 in stage 1.0 (TID 10, localhost, executor driver, partition 7, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:09.482 dispatcher-event-loop-14 INFO BlockManagerInfo: Removed 
> broadcast_0_piece0 on amp-jenkins-worker-05.amp:36913 in memory (size: 1260.0 
> B, free: 1638.6 MB)
> {noformat}
> you'll see that the "mock backend thread" does keep making progress, but for 
> whatever reason there is over a one second delay in the middle.  Thats 
> already going over the existing timeouts.
> Its possible there is something else going on here, but for now just 
> increasing the timeouts seems like the best next step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25338) Several tests miss calling super.afterAll() in their afterAll() method

2018-09-13 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25338.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22337
[https://github.com/apache/spark/pull/22337]

> Several tests miss calling super.afterAll() in their afterAll() method
> --
>
> Key: SPARK-25338
> URL: https://issues.apache.org/jira/browse/SPARK-25338
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 3.0.0
>
>
> The following tests under {{external}} may not call {{super.afterAll()}} in 
> their {{afterAll()}} method.
> {code}
> external/flume/src/test/scala/org/apache/spark/streaming/flume/FlumePollingStreamSuite.scala
> external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala
> external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala
> external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala
> external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala
> external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala
> external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaClusterSuite.scala
> external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala
> external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala
> external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisInputDStreamBuilderSuite.scala
> external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisStreamSuite.scala
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25338) Several tests miss calling super.afterAll() in their afterAll() method

2018-09-13 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25338:
-

Assignee: Kazuaki Ishizaki

> Several tests miss calling super.afterAll() in their afterAll() method
> --
>
> Key: SPARK-25338
> URL: https://issues.apache.org/jira/browse/SPARK-25338
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 3.0.0
>
>
> The following tests under {{external}} may not call {{super.afterAll()}} in 
> their {{afterAll()}} method.
> {code}
> external/flume/src/test/scala/org/apache/spark/streaming/flume/FlumePollingStreamSuite.scala
> external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala
> external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala
> external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala
> external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala
> external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala
> external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaClusterSuite.scala
> external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala
> external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala
> external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisInputDStreamBuilderSuite.scala
> external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisStreamSuite.scala
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25424) Window duration and slide duration with negative values should fail fast

2018-09-13 Thread Raghav Kumar Gautam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raghav Kumar Gautam updated SPARK-25424:

Fix Version/s: (was: 2.3.2)
   2.4.0

> Window duration and slide duration with negative values should fail fast
> 
>
> Key: SPARK-25424
> URL: https://issues.apache.org/jira/browse/SPARK-25424
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Raghav Kumar Gautam
>Priority: Major
> Fix For: 2.4.0
>
>
> In TimeWindow class window duration and slide duration should not be allowed 
> to take negative values.
> Currently this behaviour enforced by catalyst. It can be enforced by 
> constructor of TimeWindow allowing it to fail fast.
> For e.g. the code below throws following error. Note that the error is 
> produced at the time of count() call instead of window() call.
> {code:java}
> val df = spark.readStream
>   .format("rate")
>   .option("numPartitions", "2")
>   .option("rowsPerSecond", "10")
>   .load()
>   .filter("value % 20 == 0")
>   .withWatermark("timestamp", "10 seconds")
>   .groupBy(window($"timestamp", "-10 seconds", "5 seconds"))
>   .count()
> {code}
> Error:
> {code:java}
> cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data 
> type mismatch: The window duration (-1000) must be greater than 0.;;
> 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
> [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
> count(1) AS count#57L]
> +- AnalysisBarrier
>   +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
>  +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
> +- StreamingRelationV2 
> org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
> Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
> StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
>  -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]
> org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, 
> -1000, 500, 0)' due to data type mismatch: The window duration 
> (-1000) must be greater than 0.;;
> 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
> [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
> count(1) AS count#57L]
> +- AnalysisBarrier
>   +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
>  +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
> +- StreamingRelationV2 
> org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
> Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
> StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
>  -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118)
>   at 
> 

[jira] [Updated] (SPARK-25424) Window duration and slide duration with negative values should fail fast

2018-09-13 Thread Raghav Kumar Gautam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raghav Kumar Gautam updated SPARK-25424:

Description: 
In TimeWindow class window duration and slide duration should not be allowed to 
take negative values.

Currently this behaviour enforced by catalyst. It can be enforced by 
constructor of TimeWindow allowing it to fail fast.

For e.g. the code below throws following error. Note that the error is produced 
at the time of count() call instead of window() call.
{code:java}
val df = spark.readStream
  .format("rate")
  .option("numPartitions", "2")
  .option("rowsPerSecond", "10")
  .load()
  .filter("value % 20 == 0")
  .withWatermark("timestamp", "10 seconds")
  .groupBy(window($"timestamp", "-10 seconds", "5 seconds"))
  .count()
{code}
Error:
{code:java}
cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data type 
mismatch: The window duration (-1000) must be greater than 0.;;
'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
[timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
count(1) AS count#57L]
+- AnalysisBarrier
  +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
 +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
+- StreamingRelationV2 
org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
StreamingRelation 
DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
 -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]

org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, 
-1000, 500, 0)' due to data type mismatch: The window duration 
(-1000) must be greater than 0.;;
'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
[timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
count(1) AS count#57L]
+- AnalysisBarrier
  +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
 +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
+- StreamingRelationV2 
org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
StreamingRelation 
DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
 -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:122)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 

[jira] [Commented] (SPARK-21291) R bucketBy partitionBy API

2018-09-13 Thread Huaxin Gao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613832#comment-16613832
 ] 

Huaxin Gao commented on SPARK-21291:


[~felixcheung]

I am working on this, but not sure if my approach is correct. I am thinking of 
having the following code:
{code:java}
setMethod("write.partitionBy",
  signature(x = "SparkDataFrame"),
  function(x, ...) {
 jcols <- lapply(list(...), function(arg) {
   stopifnot(class(arg) == "character")
   arg
 })
write <- callJMethod(x@sdf, "write")
invisible(handledCallJMethod(write, "partitionBy", jcols))
  })
{code}
The method returns a DataFrameWriter, but it seems that the DataFrameWriter 
can't be used directly in R. The DataFrameWriter methods, for example, 
text(path: String), is implemented in R as write.text in DataFrame.R, so I am 
not sure if it's correct for me to return a DataFrameWriter for partitionBy. 

> R bucketBy partitionBy API
> --
>
> Key: SPARK-21291
> URL: https://issues.apache.org/jira/browse/SPARK-21291
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Priority: Major
>
> partitionBy exists but it's for windowspec only



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25424) Window duration and slide duration with negative values should fail fast

2018-09-13 Thread Raghav Kumar Gautam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613830#comment-16613830
 ] 

Raghav Kumar Gautam commented on SPARK-25424:
-

I have a patch for this. Can someone assign this issue to me ?

> Window duration and slide duration with negative values should fail fast
> 
>
> Key: SPARK-25424
> URL: https://issues.apache.org/jira/browse/SPARK-25424
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Raghav Kumar Gautam
>Priority: Major
> Fix For: 2.3.2
>
>
> In TimeWindow class window duration and slide duration is not be allowed to 
> take negative values.
> Currently this is enforced by catalyst. It can be enforced by constructor of 
> TimeWindow allowing it to fail fast.
> For e.g. the code below throws following error. Note that the error is 
> produced at the time of count() call instead of window() call.
> {code}
> val df = spark.readStream
>   .format("rate")
>   .option("numPartitions", "2")
>   .option("rowsPerSecond", "10")
>   .load()
>   .filter("value % 20 == 0")
>   .withWatermark("timestamp", "10 seconds")
>   .groupBy(window($"timestamp", "-10 seconds", "5 seconds"))
>   .count()
> {code}
> Error:
> {code}
> cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data 
> type mismatch: The window duration (-1000) must be greater than 0.;;
> 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
> [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
> count(1) AS count#57L]
> +- AnalysisBarrier
>   +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
>  +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
> +- StreamingRelationV2 
> org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
> Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
> StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
>  -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]
> org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, 
> -1000, 500, 0)' due to data type mismatch: The window duration 
> (-1000) must be greater than 0.;;
> 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
> [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
> count(1) AS count#57L]
> +- AnalysisBarrier
>   +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
>  +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
> +- StreamingRelationV2 
> org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
> Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
> StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
>  -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118)
>   at 
> 

[jira] [Updated] (SPARK-25424) Window duration and slide duration with negative values should fail fast

2018-09-13 Thread Raghav Kumar Gautam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raghav Kumar Gautam updated SPARK-25424:

Target Version/s: 2.4.0  (was: 2.3.2)

> Window duration and slide duration with negative values should fail fast
> 
>
> Key: SPARK-25424
> URL: https://issues.apache.org/jira/browse/SPARK-25424
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Raghav Kumar Gautam
>Priority: Major
> Fix For: 2.3.2
>
>
> In TimeWindow class window duration and slide duration is not be allowed to 
> take negative values.
> Currently this is enforced by catalyst. It can be enforced by constructor of 
> TimeWindow allowing it to fail fast.
> For e.g. the code below throws following error. Note that the error is 
> produced at the time of count() call instead of window() call.
> {code}
> val df = spark.readStream
>   .format("rate")
>   .option("numPartitions", "2")
>   .option("rowsPerSecond", "10")
>   .load()
>   .filter("value % 20 == 0")
>   .withWatermark("timestamp", "10 seconds")
>   .groupBy(window($"timestamp", "-10 seconds", "5 seconds"))
>   .count()
> {code}
> Error:
> {code}
> cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data 
> type mismatch: The window duration (-1000) must be greater than 0.;;
> 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
> [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
> count(1) AS count#57L]
> +- AnalysisBarrier
>   +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
>  +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
> +- StreamingRelationV2 
> org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
> Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
> StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
>  -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]
> org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, 
> -1000, 500, 0)' due to data type mismatch: The window duration 
> (-1000) must be greater than 0.;;
> 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
> [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
> count(1) AS count#57L]
> +- AnalysisBarrier
>   +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
>  +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
> +- StreamingRelationV2 
> org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
> Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
> StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
>  -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:122)
>   at 
> 

[jira] [Created] (SPARK-25424) Window duration and slide duration with negative values should fail fast

2018-09-13 Thread Raghav Kumar Gautam (JIRA)
Raghav Kumar Gautam created SPARK-25424:
---

 Summary: Window duration and slide duration with negative values 
should fail fast
 Key: SPARK-25424
 URL: https://issues.apache.org/jira/browse/SPARK-25424
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Raghav Kumar Gautam
 Fix For: 2.3.2


In TimeWindow class window duration and slide duration is not be allowed to 
take negative values.

Currently this is enforced by catalyst. It can be enforced by constructor of 
TimeWindow allowing it to fail fast.

For e.g. the code below throws following error. Note that the error is produced 
at the time of count() call instead of window() call.
{code}
val df = spark.readStream
  .format("rate")
  .option("numPartitions", "2")
  .option("rowsPerSecond", "10")
  .load()
  .filter("value % 20 == 0")
  .withWatermark("timestamp", "10 seconds")
  .groupBy(window($"timestamp", "-10 seconds", "5 seconds"))
  .count()
{code}

Error:
{code}
cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data type 
mismatch: The window duration (-1000) must be greater than 0.;;
'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
[timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
count(1) AS count#57L]
+- AnalysisBarrier
  +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
 +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
+- StreamingRelationV2 
org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
StreamingRelation 
DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
 -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]

org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, 
-1000, 500, 0)' due to data type mismatch: The window duration 
(-1000) must be greater than 0.;;
'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
[timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
count(1) AS count#57L]
+- AnalysisBarrier
  +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
 +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
+- StreamingRelationV2 
org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
StreamingRelation 
DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
 -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:122)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 

[jira] [Updated] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata

2018-09-13 Thread Maryann Xue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated SPARK-25423:

Summary: Output "dataFilters" in DataSourceScanExec.metadata  (was: Output 
"dataFilters" in DataSourceScanExec.toString)

> Output "dataFilters" in DataSourceScanExec.metadata
> ---
>
> Key: SPARK-25423
> URL: https://issues.apache.org/jira/browse/SPARK-25423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25406) Incorrect usage of withSQLConf method in Parquet schema pruning test suite masks failing tests

2018-09-13 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-25406.
-
   Resolution: Fixed
Fix Version/s: 2.4.0
   3.0.0

Issue resolved by pull request 22394
[https://github.com/apache/spark/pull/22394]

> Incorrect usage of withSQLConf method in Parquet schema pruning test suite 
> masks failing tests
> --
>
> Key: SPARK-25406
> URL: https://issues.apache.org/jira/browse/SPARK-25406
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Michael Allman
>Assignee: Michael Allman
>Priority: Major
> Fix For: 3.0.0, 2.4.0
>
>
> In {{ParquetSchemaPruning.scala}}, we use the helper method {{withSQLConf}} 
> to set configuration settings within the scope of a test. However, the way we 
> use that method is incorrect, and as a result the desired configuration 
> settings are not propagated to the test cases. This masks some test case 
> failures.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25406) Incorrect usage of withSQLConf method in Parquet schema pruning test suite masks failing tests

2018-09-13 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-25406:
---

Assignee: Michael Allman

> Incorrect usage of withSQLConf method in Parquet schema pruning test suite 
> masks failing tests
> --
>
> Key: SPARK-25406
> URL: https://issues.apache.org/jira/browse/SPARK-25406
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Michael Allman
>Assignee: Michael Allman
>Priority: Major
> Fix For: 2.4.0, 3.0.0
>
>
> In {{ParquetSchemaPruning.scala}}, we use the helper method {{withSQLConf}} 
> to set configuration settings within the scope of a test. However, the way we 
> use that method is incorrect, and as a result the desired configuration 
> settings are not propagated to the test cases. This masks some test case 
> failures.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.toString

2018-09-13 Thread Maryann Xue (JIRA)
Maryann Xue created SPARK-25423:
---

 Summary: Output "dataFilters" in DataSourceScanExec.toString
 Key: SPARK-25423
 URL: https://issues.apache.org/jira/browse/SPARK-25423
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maryann Xue






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25170) Add Task Metrics description to the documentation

2018-09-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25170:
-

Assignee: Luca Canali

> Add Task Metrics description to the documentation
> -
>
> Key: SPARK-25170
> URL: https://issues.apache.org/jira/browse/SPARK-25170
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
>  Labels: documentation
> Fix For: 2.4.0
>
>
> The REST API, as well as other instrumentation and tools based on the Spark 
> ListenerBus expose the values of the Executor Task Metrics, which are quite 
> useful for workload/ performance troubleshooting. I propose to add the list 
> of the available Task Metrics with a short description to the monitoring 
> documentation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25170) Add Task Metrics description to the documentation

2018-09-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25170.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22397
[https://github.com/apache/spark/pull/22397]

> Add Task Metrics description to the documentation
> -
>
> Key: SPARK-25170
> URL: https://issues.apache.org/jira/browse/SPARK-25170
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
>  Labels: documentation
> Fix For: 2.4.0
>
>
> The REST API, as well as other instrumentation and tools based on the Spark 
> ListenerBus expose the values of the Executor Task Metrics, which are quite 
> useful for workload/ performance troubleshooting. I propose to add the list 
> of the available Task Metrics with a short description to the monitoring 
> documentation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25407) Spark throws a `ParquetDecodingException` when attempting to read a field from a complex type in certain cases of schema merging

2018-09-13 Thread Michael Allman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-25407:
---
Description: 
Spark supports merging schemata across table partitions in which one partition 
is missing a subfield that's present in another. However, attempting to select 
that missing field with a query that includes a partition pruning predicate 
that filters out the partitions that include that field results in a 
`ParquetDecodingException` when attempting to get the query results.

This bug is specifically exercised by the failing (but ignored) test case 
[https://github.com/apache/spark/blob/f2d35427eedeacceb6edb8a51974a7e8bbb94bc2/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala#L125-L131].

  was:
Spark supports merging schemata across table partitions in which one partition 
is missing a subfield that's present in another. However, attempting to select 
that missing field with a query that includes a partition pruning predicate the 
filters out the partitions that include that field results in a 
`ParquetDecodingException` when attempting to get the query results.

This bug is specifically exercised by the failing (but ignored) test case 
https://github.com/apache/spark/blob/f2d35427eedeacceb6edb8a51974a7e8bbb94bc2/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala#L125-L131.


> Spark throws a `ParquetDecodingException` when attempting to read a field 
> from a complex type in certain cases of schema merging
> 
>
> Key: SPARK-25407
> URL: https://issues.apache.org/jira/browse/SPARK-25407
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Michael Allman
>Priority: Major
>
> Spark supports merging schemata across table partitions in which one 
> partition is missing a subfield that's present in another. However, 
> attempting to select that missing field with a query that includes a 
> partition pruning predicate that filters out the partitions that include that 
> field results in a `ParquetDecodingException` when attempting to get the 
> query results.
> This bug is specifically exercised by the failing (but ignored) test case 
> [https://github.com/apache/spark/blob/f2d35427eedeacceb6edb8a51974a7e8bbb94bc2/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala#L125-L131].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25422) flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated (encryption = on) (with replication as stream)

2018-09-13 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613575#comment-16613575
 ] 

Wenchen Fan commented on SPARK-25422:
-

cc [~squito] is it related with the 2GB limitation change?

> flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated 
> (encryption = on) (with replication as stream)
> 
>
> Key: SPARK-25422
> URL: https://issues.apache.org/jira/browse/SPARK-25422
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> stacktrace
> {code}
>  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 7, localhost, executor 1): java.io.IOException: 
> org.apache.spark.SparkException: corrupt remote block broadcast_0_piece0 of 
> broadcast_0: 1651574976 != 1165629262
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1320)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
>   at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:84)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:367)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1347)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:373)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: corrupt remote block 
> broadcast_0_piece0 of broadcast_0: 1651574976 != 1165629262
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:167)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:151)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:231)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1313)
>   ... 13 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25422) flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated (encryption = on) (with replication as stream)

2018-09-13 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-25422:
---

 Summary: flaky test: org.apache.spark.DistributedSuite.caching on 
disk, replicated (encryption = on) (with replication as stream)
 Key: SPARK-25422
 URL: https://issues.apache.org/jira/browse/SPARK-25422
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Wenchen Fan


stacktrace
{code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 
7, localhost, executor 1): java.io.IOException: 
org.apache.spark.SparkException: corrupt remote block broadcast_0_piece0 of 
broadcast_0: 1651574976 != 1165629262
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1320)
at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:84)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:367)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1347)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:373)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: corrupt remote block 
broadcast_0_piece0 of broadcast_0: 1651574976 != 1165629262
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:167)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151)
at scala.collection.immutable.List.foreach(List.scala:392)
at 
org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:151)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:231)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1313)
... 13 more

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25402) Null handling in BooleanSimplification

2018-09-13 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25402:
--
Fix Version/s: 2.2.3

> Null handling in BooleanSimplification
> --
>
> Key: SPARK-25402
> URL: https://issues.apache.org/jira/browse/SPARK-25402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> SPARK-20350 introduced a bug BooleanSimplification for null handling. For 
> example, the following case returns a wrong answer. 
> {code}
> val schema = StructType.fromDDL("a boolean, b int")
> val rows = Seq(Row(null, 1))
> val rdd = sparkContext.parallelize(rows)
> val df = spark.createDataFrame(rdd, schema)
> checkAnswer(df.where("(NOT a) OR a"), Seq.empty)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25420) Dataset.count() every time is different.

2018-09-13 Thread huanghuai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huanghuai updated SPARK-25420:
--
Priority: Trivial  (was: Major)

> Dataset.count()  every time is different.
> -
>
> Key: SPARK-25420
> URL: https://issues.apache.org/jira/browse/SPARK-25420
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: spark2.3
> standalone
>Reporter: huanghuai
>Priority: Trivial
>  Labels: SQL
>
> Dataset dataset = sparkSession.read().format("csv").option("sep", 
> ",").option("inferSchema", "true")
>  .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
>  .option("encoding", "UTF-8")
>  .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");
> System.out.println("source count="+dataset.count());
> Dataset dropDuplicates = dataset.dropDuplicates(new 
> String[]\{"DATE","TIME","VEL","COMPANY"});
> System.out.println("dropDuplicates count1="+dropDuplicates.count());
> System.out.println("dropDuplicates count2="+dropDuplicates.count());
> Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 
> and (status = 0 or status = 1)");
> System.out.println("filter count1="+filter.count());
> System.out.println("filter count2="+filter.count());
> System.out.println("filter count3="+filter.count());
> System.out.println("filter count4="+filter.count());
> System.out.println("filter count5="+filter.count());
>  
>  
> --The above is code 
> ---
>  
>  
> console output:
> source count=459275
> dropDuplicates count1=453987
> dropDuplicates count2=453987
> filter count1=445798
> filter count2=445797
> filter count3=445797
> filter count4=445798
> filter count5=445799
>  
> question:
>  
> Why is filter.count() different everytime?
> if I remove dropDuplicates() everything will be ok!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25420) Dataset.count() every time is different.

2018-09-13 Thread huanghuai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huanghuai updated SPARK-25420:
--
Priority: Major  (was: Trivial)

> Dataset.count()  every time is different.
> -
>
> Key: SPARK-25420
> URL: https://issues.apache.org/jira/browse/SPARK-25420
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: spark2.3
> standalone
>Reporter: huanghuai
>Priority: Major
>  Labels: SQL
>
> Dataset dataset = sparkSession.read().format("csv").option("sep", 
> ",").option("inferSchema", "true")
>  .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
>  .option("encoding", "UTF-8")
>  .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");
> System.out.println("source count="+dataset.count());
> Dataset dropDuplicates = dataset.dropDuplicates(new 
> String[]\{"DATE","TIME","VEL","COMPANY"});
> System.out.println("dropDuplicates count1="+dropDuplicates.count());
> System.out.println("dropDuplicates count2="+dropDuplicates.count());
> Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 
> and (status = 0 or status = 1)");
> System.out.println("filter count1="+filter.count());
> System.out.println("filter count2="+filter.count());
> System.out.println("filter count3="+filter.count());
> System.out.println("filter count4="+filter.count());
> System.out.println("filter count5="+filter.count());
>  
>  
> --The above is code 
> ---
>  
>  
> console output:
> source count=459275
> dropDuplicates count1=453987
> dropDuplicates count2=453987
> filter count1=445798
> filter count2=445797
> filter count3=445797
> filter count4=445798
> filter count5=445799
>  
> question:
>  
> Why is filter.count() different everytime?
> if I remove dropDuplicates() everything will be ok!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25420) Dataset.count() every time is different.

2018-09-13 Thread huanghuai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huanghuai updated SPARK-25420:
--
Issue Type: Question  (was: Bug)

> Dataset.count()  every time is different.
> -
>
> Key: SPARK-25420
> URL: https://issues.apache.org/jira/browse/SPARK-25420
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: spark2.3
> standalone
>Reporter: huanghuai
>Priority: Major
>  Labels: SQL
>
> Dataset dataset = sparkSession.read().format("csv").option("sep", 
> ",").option("inferSchema", "true")
>  .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
>  .option("encoding", "UTF-8")
>  .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");
> System.out.println("source count="+dataset.count());
> Dataset dropDuplicates = dataset.dropDuplicates(new 
> String[]\{"DATE","TIME","VEL","COMPANY"});
> System.out.println("dropDuplicates count1="+dropDuplicates.count());
> System.out.println("dropDuplicates count2="+dropDuplicates.count());
> Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 
> and (status = 0 or status = 1)");
> System.out.println("filter count1="+filter.count());
> System.out.println("filter count2="+filter.count());
> System.out.println("filter count3="+filter.count());
> System.out.println("filter count4="+filter.count());
> System.out.println("filter count5="+filter.count());
>  
>  
> --The above is code 
> ---
>  
>  
> console output:
> source count=459275
> dropDuplicates count1=453987
> dropDuplicates count2=453987
> filter count1=445798
> filter count2=445797
> filter count3=445797
> filter count4=445798
> filter count5=445799
>  
> question:
>  
> Why is filter.count() different everytime?
> if I remove dropDuplicates() everything will be ok!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25404) Staging path may not on the expected place when table path contains the stagingDir string

2018-09-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613369#comment-16613369
 ] 

Apache Spark commented on SPARK-25404:
--

User 'fjh100456' has created a pull request for this issue:
https://github.com/apache/spark/pull/22412

> Staging path may not on the expected place when table path contains the 
> stagingDir string
> -
>
> Key: SPARK-25404
> URL: https://issues.apache.org/jira/browse/SPARK-25404
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jinhua Fu
>Priority: Minor
>
> Considering the follow scenario:
>  
> {code:java}
> SET hive.exec.stagingdir=temp;
> CREATE TABLE tempTableA(key int)  location '/spark/temp/tempTableA';
> INSERT OVERWRITE TABLE tempTableA SELECT 1;
> {code}
> We expect the staging path under the table path, such as 
> '/spark/temp/tempTableA/.hive-stagingXXX'(SPARK-20594), but actually it is 
> '/spark/tempXXX'.
> I'm not quite sure why we use the 'if ... else ...' when getting a 
> stagingDir, but it maybe the cause of this bug.
>  
> {code:java}
> // SaveAsHiveFile.scala
> private def getStagingDir(
> inputPath: Path,
> hadoopConf: Configuration,
> stagingDir: String): Path = {
>   ..
>   var stagingPathName: String =
>   if (inputPathName.indexOf(stagingDir) == -1) {
> new Path(inputPathName, stagingDir).toString
>   } else {
> // The 'indexOf' may not get expected position, and this may be the cause 
> of this bug.
> inputPathName.substring(0, inputPathName.indexOf(stagingDir) + 
> stagingDir.length)
>   }
>   ..
> }
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25404) Staging path may not on the expected place when table path contains the stagingDir string

2018-09-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25404:


Assignee: (was: Apache Spark)

> Staging path may not on the expected place when table path contains the 
> stagingDir string
> -
>
> Key: SPARK-25404
> URL: https://issues.apache.org/jira/browse/SPARK-25404
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jinhua Fu
>Priority: Minor
>
> Considering the follow scenario:
>  
> {code:java}
> SET hive.exec.stagingdir=temp;
> CREATE TABLE tempTableA(key int)  location '/spark/temp/tempTableA';
> INSERT OVERWRITE TABLE tempTableA SELECT 1;
> {code}
> We expect the staging path under the table path, such as 
> '/spark/temp/tempTableA/.hive-stagingXXX'(SPARK-20594), but actually it is 
> '/spark/tempXXX'.
> I'm not quite sure why we use the 'if ... else ...' when getting a 
> stagingDir, but it maybe the cause of this bug.
>  
> {code:java}
> // SaveAsHiveFile.scala
> private def getStagingDir(
> inputPath: Path,
> hadoopConf: Configuration,
> stagingDir: String): Path = {
>   ..
>   var stagingPathName: String =
>   if (inputPathName.indexOf(stagingDir) == -1) {
> new Path(inputPathName, stagingDir).toString
>   } else {
> // The 'indexOf' may not get expected position, and this may be the cause 
> of this bug.
> inputPathName.substring(0, inputPathName.indexOf(stagingDir) + 
> stagingDir.length)
>   }
>   ..
> }
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25404) Staging path may not on the expected place when table path contains the stagingDir string

2018-09-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25404:


Assignee: Apache Spark

> Staging path may not on the expected place when table path contains the 
> stagingDir string
> -
>
> Key: SPARK-25404
> URL: https://issues.apache.org/jira/browse/SPARK-25404
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jinhua Fu
>Assignee: Apache Spark
>Priority: Minor
>
> Considering the follow scenario:
>  
> {code:java}
> SET hive.exec.stagingdir=temp;
> CREATE TABLE tempTableA(key int)  location '/spark/temp/tempTableA';
> INSERT OVERWRITE TABLE tempTableA SELECT 1;
> {code}
> We expect the staging path under the table path, such as 
> '/spark/temp/tempTableA/.hive-stagingXXX'(SPARK-20594), but actually it is 
> '/spark/tempXXX'.
> I'm not quite sure why we use the 'if ... else ...' when getting a 
> stagingDir, but it maybe the cause of this bug.
>  
> {code:java}
> // SaveAsHiveFile.scala
> private def getStagingDir(
> inputPath: Path,
> hadoopConf: Configuration,
> stagingDir: String): Path = {
>   ..
>   var stagingPathName: String =
>   if (inputPathName.indexOf(stagingDir) == -1) {
> new Path(inputPathName, stagingDir).toString
>   } else {
> // The 'indexOf' may not get expected position, and this may be the cause 
> of this bug.
> inputPathName.substring(0, inputPathName.indexOf(stagingDir) + 
> stagingDir.length)
>   }
>   ..
> }
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25404) Staging path may not on the expected place when table path contains the stagingDir string

2018-09-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613368#comment-16613368
 ] 

Apache Spark commented on SPARK-25404:
--

User 'fjh100456' has created a pull request for this issue:
https://github.com/apache/spark/pull/22412

> Staging path may not on the expected place when table path contains the 
> stagingDir string
> -
>
> Key: SPARK-25404
> URL: https://issues.apache.org/jira/browse/SPARK-25404
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jinhua Fu
>Priority: Minor
>
> Considering the follow scenario:
>  
> {code:java}
> SET hive.exec.stagingdir=temp;
> CREATE TABLE tempTableA(key int)  location '/spark/temp/tempTableA';
> INSERT OVERWRITE TABLE tempTableA SELECT 1;
> {code}
> We expect the staging path under the table path, such as 
> '/spark/temp/tempTableA/.hive-stagingXXX'(SPARK-20594), but actually it is 
> '/spark/tempXXX'.
> I'm not quite sure why we use the 'if ... else ...' when getting a 
> stagingDir, but it maybe the cause of this bug.
>  
> {code:java}
> // SaveAsHiveFile.scala
> private def getStagingDir(
> inputPath: Path,
> hadoopConf: Configuration,
> stagingDir: String): Path = {
>   ..
>   var stagingPathName: String =
>   if (inputPathName.indexOf(stagingDir) == -1) {
> new Path(inputPathName, stagingDir).toString
>   } else {
> // The 'indexOf' may not get expected position, and this may be the cause 
> of this bug.
> inputPathName.substring(0, inputPathName.indexOf(stagingDir) + 
> stagingDir.length)
>   }
>   ..
> }
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25421) Abstract an output path field in trait DataWritingCommand

2018-09-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25421:


Assignee: Apache Spark

> Abstract an output path field in trait DataWritingCommand
> -
>
> Key: SPARK-25421
> URL: https://issues.apache.org/jira/browse/SPARK-25421
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Major
>
> [SPARK-25357|https://issues.apache.org/jira/browse/SPARK-25357] import a 
> metadata field in {{SparkPlanInfo}} and it could dump the input location for 
> read. Corresponding, we need add a field in {{DataWritingCommand}} for output 
> path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25421) Abstract an output path field in trait DataWritingCommand

2018-09-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613327#comment-16613327
 ] 

Apache Spark commented on SPARK-25421:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22411

> Abstract an output path field in trait DataWritingCommand
> -
>
> Key: SPARK-25421
> URL: https://issues.apache.org/jira/browse/SPARK-25421
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Lantao Jin
>Priority: Major
>
> [SPARK-25357|https://issues.apache.org/jira/browse/SPARK-25357] import a 
> metadata field in {{SparkPlanInfo}} and it could dump the input location for 
> read. Corresponding, we need add a field in {{DataWritingCommand}} for output 
> path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25421) Abstract an output path field in trait DataWritingCommand

2018-09-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613328#comment-16613328
 ] 

Apache Spark commented on SPARK-25421:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22411

> Abstract an output path field in trait DataWritingCommand
> -
>
> Key: SPARK-25421
> URL: https://issues.apache.org/jira/browse/SPARK-25421
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Lantao Jin
>Priority: Major
>
> [SPARK-25357|https://issues.apache.org/jira/browse/SPARK-25357] import a 
> metadata field in {{SparkPlanInfo}} and it could dump the input location for 
> read. Corresponding, we need add a field in {{DataWritingCommand}} for output 
> path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25421) Abstract an output path field in trait DataWritingCommand

2018-09-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25421:


Assignee: (was: Apache Spark)

> Abstract an output path field in trait DataWritingCommand
> -
>
> Key: SPARK-25421
> URL: https://issues.apache.org/jira/browse/SPARK-25421
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Lantao Jin
>Priority: Major
>
> [SPARK-25357|https://issues.apache.org/jira/browse/SPARK-25357] import a 
> metadata field in {{SparkPlanInfo}} and it could dump the input location for 
> read. Corresponding, we need add a field in {{DataWritingCommand}} for output 
> path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25421) Abstract an output path field in trait DataWritingCommand

2018-09-13 Thread Lantao Jin (JIRA)
Lantao Jin created SPARK-25421:
--

 Summary: Abstract an output path field in trait DataWritingCommand
 Key: SPARK-25421
 URL: https://issues.apache.org/jira/browse/SPARK-25421
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Lantao Jin


[SPARK-25357|https://issues.apache.org/jira/browse/SPARK-25357] import a 
metadata field in {{SparkPlanInfo}} and it could dump the input location for 
read. Corresponding, we need add a field in {{DataWritingCommand}} for output 
path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25420) Dataset.count() every time is different.

2018-09-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613217#comment-16613217
 ] 

Marco Gaido commented on SPARK-25420:
-

I think the reason here is that since we don't enforce any sorting on the 
incoming Dataset and we take the first row among those with the same 
aggregation columns, the output of dropDuplicates is random on the row which is 
chosen for each group. So this doesn't really seem a bug itself, but the output 
of dropDuplicates is somewhat non-deterministic unless a specific ordering on 
the input is not enforced.

> Dataset.count()  every time is different.
> -
>
> Key: SPARK-25420
> URL: https://issues.apache.org/jira/browse/SPARK-25420
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: spark2.3
> standalone
>Reporter: huanghuai
>Priority: Major
>  Labels: SQL
>
> Dataset dataset = sparkSession.read().format("csv").option("sep", 
> ",").option("inferSchema", "true")
>  .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
>  .option("encoding", "UTF-8")
>  .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");
> System.out.println("source count="+dataset.count());
> Dataset dropDuplicates = dataset.dropDuplicates(new 
> String[]\{"DATE","TIME","VEL","COMPANY"});
> System.out.println("dropDuplicates count1="+dropDuplicates.count());
> System.out.println("dropDuplicates count2="+dropDuplicates.count());
> Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 
> and (status = 0 or status = 1)");
> System.out.println("filter count1="+filter.count());
> System.out.println("filter count2="+filter.count());
> System.out.println("filter count3="+filter.count());
> System.out.println("filter count4="+filter.count());
> System.out.println("filter count5="+filter.count());
>  
>  
> --The above is code 
> ---
>  
>  
> console output:
> source count=459275
> dropDuplicates count1=453987
> dropDuplicates count2=453987
> filter count1=445798
> filter count2=445797
> filter count3=445797
> filter count4=445798
> filter count5=445799
>  
> question:
>  
> Why is filter.count() different everytime?
> if I remove dropDuplicates() everything will be ok!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25420) Dataset.count() every time is different.

2018-09-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613209#comment-16613209
 ] 

Marco Gaido edited comment on SPARK-25420 at 9/13/18 8:51 AM:
--

Please do not use Critical/Blocker as they are reserved for committers.


was (Author: mgaido):
Please do not use Critical/Blocker as they are reserved for committers. Anyway, 
this seems a correctness issue, so I'd agree raising the priority.

> Dataset.count()  every time is different.
> -
>
> Key: SPARK-25420
> URL: https://issues.apache.org/jira/browse/SPARK-25420
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: spark2.3
> standalone
>Reporter: huanghuai
>Priority: Major
>  Labels: SQL
>
> Dataset dataset = sparkSession.read().format("csv").option("sep", 
> ",").option("inferSchema", "true")
>  .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
>  .option("encoding", "UTF-8")
>  .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");
> System.out.println("source count="+dataset.count());
> Dataset dropDuplicates = dataset.dropDuplicates(new 
> String[]\{"DATE","TIME","VEL","COMPANY"});
> System.out.println("dropDuplicates count1="+dropDuplicates.count());
> System.out.println("dropDuplicates count2="+dropDuplicates.count());
> Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 
> and (status = 0 or status = 1)");
> System.out.println("filter count1="+filter.count());
> System.out.println("filter count2="+filter.count());
> System.out.println("filter count3="+filter.count());
> System.out.println("filter count4="+filter.count());
> System.out.println("filter count5="+filter.count());
>  
>  
> --The above is code 
> ---
>  
>  
> console output:
> source count=459275
> dropDuplicates count1=453987
> dropDuplicates count2=453987
> filter count1=445798
> filter count2=445797
> filter count3=445797
> filter count4=445798
> filter count5=445799
>  
> question:
>  
> Why is filter.count() different everytime?
> if I remove dropDuplicates() everything will be ok!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25420) Dataset.count() every time is different.

2018-09-13 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-25420:

Labels: SQL  (was: SQL correctness)

> Dataset.count()  every time is different.
> -
>
> Key: SPARK-25420
> URL: https://issues.apache.org/jira/browse/SPARK-25420
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: spark2.3
> standalone
>Reporter: huanghuai
>Priority: Major
>  Labels: SQL
>
> Dataset dataset = sparkSession.read().format("csv").option("sep", 
> ",").option("inferSchema", "true")
>  .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
>  .option("encoding", "UTF-8")
>  .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");
> System.out.println("source count="+dataset.count());
> Dataset dropDuplicates = dataset.dropDuplicates(new 
> String[]\{"DATE","TIME","VEL","COMPANY"});
> System.out.println("dropDuplicates count1="+dropDuplicates.count());
> System.out.println("dropDuplicates count2="+dropDuplicates.count());
> Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 
> and (status = 0 or status = 1)");
> System.out.println("filter count1="+filter.count());
> System.out.println("filter count2="+filter.count());
> System.out.println("filter count3="+filter.count());
> System.out.println("filter count4="+filter.count());
> System.out.println("filter count5="+filter.count());
>  
>  
> --The above is code 
> ---
>  
>  
> console output:
> source count=459275
> dropDuplicates count1=453987
> dropDuplicates count2=453987
> filter count1=445798
> filter count2=445797
> filter count3=445797
> filter count4=445798
> filter count5=445799
>  
> question:
>  
> Why is filter.count() different everytime?
> if I remove dropDuplicates() everything will be ok!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25420) Dataset.count() every time is different.

2018-09-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613209#comment-16613209
 ] 

Marco Gaido commented on SPARK-25420:
-

Please do not use Critical/Blocker as they are reserved for committers. Anyway, 
this seems a correctness issue, so I'd agree raising the priority.

> Dataset.count()  every time is different.
> -
>
> Key: SPARK-25420
> URL: https://issues.apache.org/jira/browse/SPARK-25420
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: spark2.3
> standalone
>Reporter: huanghuai
>Priority: Major
>  Labels: SQL, correctness
>
> Dataset dataset = sparkSession.read().format("csv").option("sep", 
> ",").option("inferSchema", "true")
>  .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
>  .option("encoding", "UTF-8")
>  .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");
> System.out.println("source count="+dataset.count());
> Dataset dropDuplicates = dataset.dropDuplicates(new 
> String[]\{"DATE","TIME","VEL","COMPANY"});
> System.out.println("dropDuplicates count1="+dropDuplicates.count());
> System.out.println("dropDuplicates count2="+dropDuplicates.count());
> Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 
> and (status = 0 or status = 1)");
> System.out.println("filter count1="+filter.count());
> System.out.println("filter count2="+filter.count());
> System.out.println("filter count3="+filter.count());
> System.out.println("filter count4="+filter.count());
> System.out.println("filter count5="+filter.count());
>  
>  
> --The above is code 
> ---
>  
>  
> console output:
> source count=459275
> dropDuplicates count1=453987
> dropDuplicates count2=453987
> filter count1=445798
> filter count2=445797
> filter count3=445797
> filter count4=445798
> filter count5=445799
>  
> question:
>  
> Why is filter.count() different everytime?
> if I remove dropDuplicates() everything will be ok!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25420) Dataset.count() every time is different.

2018-09-13 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-25420:

Labels: SQL correctness  (was: )

> Dataset.count()  every time is different.
> -
>
> Key: SPARK-25420
> URL: https://issues.apache.org/jira/browse/SPARK-25420
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: spark2.3
> standalone
>Reporter: huanghuai
>Priority: Major
>  Labels: SQL, correctness
>
> Dataset dataset = sparkSession.read().format("csv").option("sep", 
> ",").option("inferSchema", "true")
>  .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
>  .option("encoding", "UTF-8")
>  .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");
> System.out.println("source count="+dataset.count());
> Dataset dropDuplicates = dataset.dropDuplicates(new 
> String[]\{"DATE","TIME","VEL","COMPANY"});
> System.out.println("dropDuplicates count1="+dropDuplicates.count());
> System.out.println("dropDuplicates count2="+dropDuplicates.count());
> Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 
> and (status = 0 or status = 1)");
> System.out.println("filter count1="+filter.count());
> System.out.println("filter count2="+filter.count());
> System.out.println("filter count3="+filter.count());
> System.out.println("filter count4="+filter.count());
> System.out.println("filter count5="+filter.count());
>  
>  
> --The above is code 
> ---
>  
>  
> console output:
> source count=459275
> dropDuplicates count1=453987
> dropDuplicates count2=453987
> filter count1=445798
> filter count2=445797
> filter count3=445797
> filter count4=445798
> filter count5=445799
>  
> question:
>  
> Why is filter.count() different everytime?
> if I remove dropDuplicates() everything will be ok!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25420) Dataset.count() every time is different.

2018-09-13 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-25420:

Priority: Major  (was: Critical)

> Dataset.count()  every time is different.
> -
>
> Key: SPARK-25420
> URL: https://issues.apache.org/jira/browse/SPARK-25420
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: spark2.3
> standalone
>Reporter: huanghuai
>Priority: Major
>
> Dataset dataset = sparkSession.read().format("csv").option("sep", 
> ",").option("inferSchema", "true")
>  .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
>  .option("encoding", "UTF-8")
>  .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");
> System.out.println("source count="+dataset.count());
> Dataset dropDuplicates = dataset.dropDuplicates(new 
> String[]\{"DATE","TIME","VEL","COMPANY"});
> System.out.println("dropDuplicates count1="+dropDuplicates.count());
> System.out.println("dropDuplicates count2="+dropDuplicates.count());
> Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 
> and (status = 0 or status = 1)");
> System.out.println("filter count1="+filter.count());
> System.out.println("filter count2="+filter.count());
> System.out.println("filter count3="+filter.count());
> System.out.println("filter count4="+filter.count());
> System.out.println("filter count5="+filter.count());
>  
>  
> --The above is code 
> ---
>  
>  
> console output:
> source count=459275
> dropDuplicates count1=453987
> dropDuplicates count2=453987
> filter count1=445798
> filter count2=445797
> filter count3=445797
> filter count4=445798
> filter count5=445799
>  
> question:
>  
> Why is filter.count() different everytime?
> if I remove dropDuplicates() everything will be ok!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25420) Dataset.count() every time is different.

2018-09-13 Thread huanghuai (JIRA)
huanghuai created SPARK-25420:
-

 Summary: Dataset.count()  every time is different.
 Key: SPARK-25420
 URL: https://issues.apache.org/jira/browse/SPARK-25420
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
 Environment: spark2.3

standalone
Reporter: huanghuai


Dataset dataset = sparkSession.read().format("csv").option("sep", 
",").option("inferSchema", "true")
 .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
 .option("encoding", "UTF-8")
 .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");

System.out.println("source count="+dataset.count());


Dataset dropDuplicates = dataset.dropDuplicates(new 
String[]\{"DATE","TIME","VEL","COMPANY"});
System.out.println("dropDuplicates count1="+dropDuplicates.count());
System.out.println("dropDuplicates count2="+dropDuplicates.count());


Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 and 
(status = 0 or status = 1)");

System.out.println("filter count1="+filter.count());
System.out.println("filter count2="+filter.count());
System.out.println("filter count3="+filter.count());
System.out.println("filter count4="+filter.count());
System.out.println("filter count5="+filter.count());

 

 

--The above is code 
---

 
 
console output:

source count=459275
dropDuplicates count1=453987
dropDuplicates count2=453987
filter count1=445798
filter count2=445797
filter count3=445797
filter count4=445798
filter count5=445799

 

question:
 
Why is filter.count() different everytime?

if I remove dropDuplicates() everything will be ok!!

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4

2018-09-13 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613157#comment-16613157
 ] 

Liang-Chi Hsieh edited comment on SPARK-25378 at 9/13/18 8:33 AM:
--

I think a quick fix is to use general `get` method for just `StringType` in 
`InternalRow.getAccessor`. This can allow the backward-compatible behavior for 
`StringType` when calling `toArray`.

And we may consider to correct it back to `getUTF8String` by 3.0.

WDYT? [~mengxr], [~cloud_fan], [~hyukjin.kwon].


was (Author: viirya):
I think a quick fix is to use general `get` method for just `StringType` in 
`InternalRow.getAccessor`. This can allow the backward-compatible behavior for 
`StringType` when calling `toArray`.

And we may consider to correct to `getUTF8String` by 3.0.

WDYT? [~mengxr], [~cloud_fan], [~hyukjin.kwon].

> ArrayData.toArray(StringType) assume UTF8String in 2.4
> --
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25412) FeatureHasher would change the value of output feature

2018-09-13 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613181#comment-16613181
 ] 

Vincent commented on SPARK-25412:
-

Thanks, Nick,  for the reply.

so, the tradeoff is between highly sparse vector by increasing numFeature size 
and risk of losing certain features with conflicted hash value (since changing 
the value/meaning of those features equals to making them useless ), correct?

> FeatureHasher would change the value of output feature
> --
>
> Key: SPARK-25412
> URL: https://issues.apache.org/jira/browse/SPARK-25412
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Vincent
>Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be changed with a new 
> valued (sum of current and old value)
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources

2018-09-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-24538:
--

> ByteArrayDecimalType support push down to parquet data sources
> --
>
> Key: SPARK-24538
> URL: https://issues.apache.org/jira/browse/SPARK-24538
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Latest parquet support decimal type statistics. then we can push down to the 
> data sources:
> {noformat}
> LM-SHC-16502798:parquet-mr yumwang$ java -jar 
> ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta 
> /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> file:         
> file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> creator:      parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:        org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]}
> file schema:  spark_schema
> 
> id:           REQUIRED INT64 R:0 D:0
> d1:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d2:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d3:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d4:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d5:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> d6:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1:  RC:241867 TS:15480513 OFFSET:4
> 
> id:            INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 
> SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 
> 241866, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 
> SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, 
> max: 241866.00, num_nulls: 0]
> row group 2:  RC:241867 TS:15480513 OFFSET:8584904
> 
> id:            INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 
> SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, 
> max: 483733, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 
> SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 
> 241867.00, max: 483733.00, num_nulls: 
> 0]{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Resolved] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources

2018-09-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24538.
--
Resolution: Duplicate

> ByteArrayDecimalType support push down to parquet data sources
> --
>
> Key: SPARK-24538
> URL: https://issues.apache.org/jira/browse/SPARK-24538
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Latest parquet support decimal type statistics. then we can push down to the 
> data sources:
> {noformat}
> LM-SHC-16502798:parquet-mr yumwang$ java -jar 
> ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta 
> /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> file:         
> file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> creator:      parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:        org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]}
> file schema:  spark_schema
> 
> id:           REQUIRED INT64 R:0 D:0
> d1:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d2:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d3:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d4:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d5:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> d6:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1:  RC:241867 TS:15480513 OFFSET:4
> 
> id:            INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 
> SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 
> 241866, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 
> SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, 
> max: 241866.00, num_nulls: 0]
> row group 2:  RC:241867 TS:15480513 OFFSET:8584904
> 
> id:            INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 
> SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, 
> max: 483733, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 
> SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 
> 241867.00, max: 483733.00, num_nulls: 
> 0]{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Resolved] (SPARK-24549) Support DecimalType push down to the parquet data sources

2018-09-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24549.
--
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/21556

> Support DecimalType push down to the parquet data sources
> -
>
> Key: SPARK-24549
> URL: https://issues.apache.org/jira/browse/SPARK-24549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-24549) Support DecimalType push down to the parquet data sources

2018-09-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-24549:
--

> Support DecimalType push down to the parquet data sources
> -
>
> Key: SPARK-24549
> URL: https://issues.apache.org/jira/browse/SPARK-24549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources

2018-09-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24538:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-25419

> ByteArrayDecimalType support push down to parquet data sources
> --
>
> Key: SPARK-24538
> URL: https://issues.apache.org/jira/browse/SPARK-24538
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Latest parquet support decimal type statistics. then we can push down to the 
> data sources:
> {noformat}
> LM-SHC-16502798:parquet-mr yumwang$ java -jar 
> ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta 
> /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> file:         
> file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> creator:      parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:        org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]}
> file schema:  spark_schema
> 
> id:           REQUIRED INT64 R:0 D:0
> d1:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d2:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d3:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d4:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d5:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> d6:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1:  RC:241867 TS:15480513 OFFSET:4
> 
> id:            INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 
> SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 
> 241866, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 
> SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, 
> max: 241866.00, num_nulls: 0]
> row group 2:  RC:241867 TS:15480513 OFFSET:8584904
> 
> id:            INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 
> SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, 
> max: 483733, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 
> SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 
> 241867.00, max: 483733.00, num_nulls: 
> 0]{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources

2018-09-13 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613170#comment-16613170
 ] 

Yuming Wang commented on SPARK-24538:
-

[~cloud_fan] Could you please update this ticket to *Duplicate* and update 
SPARK-24549 to *Fixed* .

> ByteArrayDecimalType support push down to parquet data sources
> --
>
> Key: SPARK-24538
> URL: https://issues.apache.org/jira/browse/SPARK-24538
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Latest parquet support decimal type statistics. then we can push down to the 
> data sources:
> {noformat}
> LM-SHC-16502798:parquet-mr yumwang$ java -jar 
> ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta 
> /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> file:         
> file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> creator:      parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:        org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]}
> file schema:  spark_schema
> 
> id:           REQUIRED INT64 R:0 D:0
> d1:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d2:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d3:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d4:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d5:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> d6:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1:  RC:241867 TS:15480513 OFFSET:4
> 
> id:            INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 
> SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 
> 241866, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 
> SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, 
> max: 241866.00, num_nulls: 0]
> row group 2:  RC:241867 TS:15480513 OFFSET:8584904
> 
> id:            INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 
> SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, 
> max: 483733, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 
> SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 
> 241867.00, max: 483733.00, num_nulls: 
> 0]{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources

2018-09-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24538:

Issue Type: Improvement  (was: Sub-task)
Parent: (was: SPARK-25419)

> ByteArrayDecimalType support push down to parquet data sources
> --
>
> Key: SPARK-24538
> URL: https://issues.apache.org/jira/browse/SPARK-24538
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Latest parquet support decimal type statistics. then we can push down to the 
> data sources:
> {noformat}
> LM-SHC-16502798:parquet-mr yumwang$ java -jar 
> ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta 
> /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> file:         
> file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> creator:      parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:        org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]}
> file schema:  spark_schema
> 
> id:           REQUIRED INT64 R:0 D:0
> d1:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d2:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d3:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d4:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d5:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> d6:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1:  RC:241867 TS:15480513 OFFSET:4
> 
> id:            INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 
> SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 
> 241866, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 
> SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, 
> max: 241866.00, num_nulls: 0]
> row group 2:  RC:241867 TS:15480513 OFFSET:8584904
> 
> id:            INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 
> SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, 
> max: 483733, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 
> SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 
> 241867.00, max: 483733.00, num_nulls: 
> 0]{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, 

[jira] [Comment Edited] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4

2018-09-13 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613157#comment-16613157
 ] 

Liang-Chi Hsieh edited comment on SPARK-25378 at 9/13/18 7:59 AM:
--

I think a quick fix is to use general `get` method for just `StringType` in 
`InternalRow.getAccessor`. This can allow the backward-compatible behavior for 
`StringType` when calling `toArray`.

And we may consider to correct to `getUTF8String` by 3.0.

WDYT? [~mengxr], [~cloud_fan], [~hyukjin.kwon].


was (Author: viirya):
I think a quick fix is to use general `get` method for just `StringType` in 
`InternalRow.getAccessor`. This can allow the backward-compatible behavior for 
`StringType` when calling `toArray`.

And we may consider to correct to `getUTF8String` by 3.0.

WDYT? [~mengxr][~cloud_fan][~hyukjin.kwon]

> ArrayData.toArray(StringType) assume UTF8String in 2.4
> --
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25412) FeatureHasher would change the value of output feature

2018-09-13 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-25412.

Resolution: Not A Bug

> FeatureHasher would change the value of output feature
> --
>
> Key: SPARK-25412
> URL: https://issues.apache.org/jira/browse/SPARK-25412
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Vincent
>Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be changed with a new 
> valued (sum of current and old value)
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25412) FeatureHasher would change the value of output feature

2018-09-13 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613160#comment-16613160
 ] 

Nick Pentreath commented on SPARK-25412:


(1) is by design. Feature hashing does not store the exact mapping from feature 
values to vector indices and so is a one way transform. Hashing gives you speed 
and requires almost no memory, but you give up the reverse mapping and you have 
the potential for hash collisions.

(2) is again by design for now. There are ways to have the sign of the feature 
value be determined also as part of a hash function, and in expectation the 
collisions zero each other out. This may be added in future work.

The impact of hash collisions can be reduced by increasing the {{numFeatures}} 
parameter. The default is probably reasonable for small to medium feature 
dimensions but should probably be increased when working with very 
high-cardinality features.

 

I don't think this can be classed as a bug as these are all design and 
tradeoffs of using feature hashing 

> FeatureHasher would change the value of output feature
> --
>
> Key: SPARK-25412
> URL: https://issues.apache.org/jira/browse/SPARK-25412
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Vincent
>Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be changed with a new 
> valued (sum of current and old value)
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24549) Support DecimalType push down to the parquet data sources

2018-09-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24549:

Fix Version/s: 2.4.0

> Support DecimalType push down to the parquet data sources
> -
>
> Key: SPARK-24549
> URL: https://issues.apache.org/jira/browse/SPARK-24549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4

2018-09-13 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613157#comment-16613157
 ] 

Liang-Chi Hsieh commented on SPARK-25378:
-

I think a quick fix is to use general `get` method for just `StringType` in 
`InternalRow.getAccessor`. This can allow the backward-compatible behavior for 
`StringType` when calling `toArray`.

And we may consider to correct to `getUTF8String` by 3.0.

WDYT? [~mengxr][~cloud_fan][~hyukjin.kwon]

> ArrayData.toArray(StringType) assume UTF8String in 2.4
> --
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources

2018-09-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24538:

Fix Version/s: (was: 2.4.0)

> ByteArrayDecimalType support push down to parquet data sources
> --
>
> Key: SPARK-24538
> URL: https://issues.apache.org/jira/browse/SPARK-24538
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Latest parquet support decimal type statistics. then we can push down to the 
> data sources:
> {noformat}
> LM-SHC-16502798:parquet-mr yumwang$ java -jar 
> ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta 
> /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> file:         
> file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> creator:      parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:        org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]}
> file schema:  spark_schema
> 
> id:           REQUIRED INT64 R:0 D:0
> d1:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d2:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d3:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d4:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d5:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> d6:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1:  RC:241867 TS:15480513 OFFSET:4
> 
> id:            INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 
> SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 
> 241866, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 
> SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, 
> max: 241866.00, num_nulls: 0]
> row group 2:  RC:241867 TS:15480513 OFFSET:8584904
> 
> id:            INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 
> SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, 
> max: 483733, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 
> SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 
> 241867.00, max: 483733.00, num_nulls: 
> 0]{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Updated] (SPARK-25207) Case-insensitve field resolution for filter pushdown when reading Parquet

2018-09-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25207:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-25419

> Case-insensitve field resolution for filter pushdown when reading Parquet
> -
>
> Key: SPARK-25207
> URL: https://issues.apache.org/jira/browse/SPARK-25207
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: yucai
>Assignee: yucai
>Priority: Major
>  Labels: Parquet
> Fix For: 2.4.0
>
> Attachments: image.png
>
>
> Currently, filter pushdown will not work if Parquet schema and Hive metastore 
> schema are in different letter cases even spark.sql.caseSensitive is false.
> Like the below case:
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> sql("select * from t where id > 0").show{code}
> -No filter will be pushed down.-
> {code}
> scala> sql("select * from t where id > 0").explain   // Filters are pushed 
> with `ID`
> == Physical Plan ==
> *(1) Project [ID#90L]
> +- *(1) Filter (isnotnull(id#90L) && (id#90L > 0))
>+- *(1) FileScan parquet default.t[ID#90L] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/tmp/data], PartitionFilters: [], 
> PushedFilters: [IsNotNull(ID), GreaterThan(ID,0)], ReadSchema: 
> struct
> scala> sql("select * from t").show// Parquet returns NULL for `ID` 
> because it has `id`.
> ++
> |  ID|
> ++
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> ++
> scala> sql("select * from t where id > 0").show   // `NULL > 0` is `false`.
> +---+
> | ID|
> +---+
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17091) Convert IN predicate to equivalent Parquet filter

2018-09-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-17091:

Affects Version/s: 2.4.0
  Component/s: SQL
   Issue Type: Sub-task  (was: Bug)
   Parent: SPARK-25419

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Andrew Duffy
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25419) Parquet predicate pushdown improvement

2018-09-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25419.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Parquet predicate pushdown improvement
> --
>
> Key: SPARK-25419
> URL: https://issues.apache.org/jira/browse/SPARK-25419
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Parquet predicate pushdown support: ByteType, ShortType, DecimalType, 
> DateType, TimestampType. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24716) Refactor ParquetFilters

2018-09-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24716:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-25419

> Refactor ParquetFilters
> ---
>
> Key: SPARK-24716
> URL: https://issues.apache.org/jira/browse/SPARK-24716
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24718) Timestamp support pushdown to parquet data source

2018-09-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24718:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-25419

> Timestamp support pushdown to parquet data source
> -
>
> Key: SPARK-24718
> URL: https://issues.apache.org/jira/browse/SPARK-24718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Some thing like this:
> {code:java}
> case ParquetSchemaType(TIMESTAMP_MICROS, INT64, null)
>   if pushDownDecimal =>
>   (n: String, v: Any) => FilterApi.eq(
> longColumn(n),
> Option(v).map(t => (t.asInstanceOf[java.sql.Timestamp].getTime * 1000)
>   .asInstanceOf[java.lang.Long]).orNull)
> case ParquetSchemaType(TIMESTAMP_MILLIS, INT64, null)
>   if pushDownDecimal =>
>   (n: String, v: Any) => FilterApi.eq(
> longColumn(n),
> Option(v).map(_.asInstanceOf[java.sql.Timestamp].getTime
>   .asInstanceOf[java.lang.Long]).orNull)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24549) Support DecimalType push down to the parquet data sources

2018-09-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24549:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-25419

> Support DecimalType push down to the parquet data sources
> -
>
> Key: SPARK-24549
> URL: https://issues.apache.org/jira/browse/SPARK-24549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24638) StringStartsWith support push down

2018-09-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24638:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-25419

> StringStartsWith support push down
> --
>
> Key: SPARK-24638
> URL: https://issues.apache.org/jira/browse/SPARK-24638
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24706) Support ByteType and ShortType pushdown to parquet

2018-09-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24706:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-25419

> Support ByteType and ShortType pushdown to parquet
> --
>
> Key: SPARK-24706
> URL: https://issues.apache.org/jira/browse/SPARK-24706
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23727) Support DATE predict push down in parquet

2018-09-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-23727:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-25419

> Support DATE predict push down in parquet
> -
>
> Key: SPARK-23727
> URL: https://issues.apache.org/jira/browse/SPARK-23727
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Assignee: yucai
>Priority: Major
> Fix For: 2.4.0
>
>
> DATE predict push down is missing, should be supported.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources

2018-09-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24538:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-25419

> ByteArrayDecimalType support push down to parquet data sources
> --
>
> Key: SPARK-24538
> URL: https://issues.apache.org/jira/browse/SPARK-24538
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Latest parquet support decimal type statistics. then we can push down to the 
> data sources:
> {noformat}
> LM-SHC-16502798:parquet-mr yumwang$ java -jar 
> ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta 
> /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> file:         
> file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> creator:      parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:        org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]}
> file schema:  spark_schema
> 
> id:           REQUIRED INT64 R:0 D:0
> d1:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d2:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d3:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d4:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d5:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> d6:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1:  RC:241867 TS:15480513 OFFSET:4
> 
> id:            INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 
> SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 
> 241866, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 
> SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, 
> max: 241866.00, num_nulls: 0]
> row group 2:  RC:241867 TS:15480513 OFFSET:8584904
> 
> id:            INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 
> SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, 
> max: 483733, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 
> SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 
> 241867.00, max: 483733.00, num_nulls: 
> 0]{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (SPARK-25419) Parquet predicate pushdown improvement

2018-09-13 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-25419:
---

 Summary: Parquet predicate pushdown improvement
 Key: SPARK-25419
 URL: https://issues.apache.org/jira/browse/SPARK-25419
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 2.4.0
Reporter: Yuming Wang


Parquet predicate pushdown support: ByteType, ShortType, DecimalType, DateType, 
TimestampType. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources

2018-09-13 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613135#comment-16613135
 ] 

Yuming Wang commented on SPARK-24538:
-

[~cloud_fan] OK, I will do it.

> ByteArrayDecimalType support push down to parquet data sources
> --
>
> Key: SPARK-24538
> URL: https://issues.apache.org/jira/browse/SPARK-24538
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Latest parquet support decimal type statistics. then we can push down to the 
> data sources:
> {noformat}
> LM-SHC-16502798:parquet-mr yumwang$ java -jar 
> ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta 
> /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> file:         
> file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> creator:      parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:        org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]}
> file schema:  spark_schema
> 
> id:           REQUIRED INT64 R:0 D:0
> d1:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d2:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d3:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d4:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d5:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> d6:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1:  RC:241867 TS:15480513 OFFSET:4
> 
> id:            INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 
> SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 
> 241866, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 
> SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, 
> max: 241866.00, num_nulls: 0]
> row group 2:  RC:241867 TS:15480513 OFFSET:8584904
> 
> id:            INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 
> SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, 
> max: 483733, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 
> SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 
> 241867.00, max: 483733.00, num_nulls: 
> 0]{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-20937) Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide

2018-09-13 Thread Sergei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613113#comment-16613113
 ] 

Sergei commented on SPARK-20937:


do you remember what did you do with it finally? 

> Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, 
> DataFrames and Datasets Guide
> -
>
> Key: SPARK-20937
> URL: https://issues.apache.org/jira/browse/SPARK-20937
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> As a follow-up to SPARK-20297 (and SPARK-10400) in which 
> {{spark.sql.parquet.writeLegacyFormat}} property was recommended for Impala 
> and Hive, Spark SQL docs for [Parquet 
> Files|https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration]
>  should have it documented.
> p.s. It was asked about in [Why can't Impala read parquet files after Spark 
> SQL's write?|https://stackoverflow.com/q/44279870/1305344] on StackOverflow 
> today.
> p.s. It's also covered in [~holden.ka...@gmail.com]'s "High Performance 
> Spark: Best Practices for Scaling and Optimizing Apache Spark" book (in Table 
> 3-10. Parquet data source options) that gives the option some wider publicity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources

2018-09-13 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613105#comment-16613105
 ] 

Wenchen Fan commented on SPARK-24538:
-

[~yumwang] can you create an umbrella JIRA ticket for all these parquet 
predicate pushdown improvement tickets? Then it will be easier to refer it in 
the release notes. Thanks!

> ByteArrayDecimalType support push down to parquet data sources
> --
>
> Key: SPARK-24538
> URL: https://issues.apache.org/jira/browse/SPARK-24538
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Latest parquet support decimal type statistics. then we can push down to the 
> data sources:
> {noformat}
> LM-SHC-16502798:parquet-mr yumwang$ java -jar 
> ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta 
> /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> file:         
> file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> creator:      parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:        org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]}
> file schema:  spark_schema
> 
> id:           REQUIRED INT64 R:0 D:0
> d1:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d2:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d3:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d4:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d5:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> d6:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1:  RC:241867 TS:15480513 OFFSET:4
> 
> id:            INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 
> SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 
> 241866, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 
> SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, 
> max: 241866.00, num_nulls: 0]
> row group 2:  RC:241867 TS:15480513 OFFSET:8584904
> 
> id:            INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 
> SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, 
> max: 483733, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 
> SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 
> 241867.00, max: 483733.00,