[jira] [Assigned] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian

2019-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26985:


Assignee: (was: Apache Spark)

> Test "access only some column of the all of columns " fails on big endian
> -
>
> Key: SPARK-26985
> URL: https://issues.apache.org/jira/browse/SPARK-26985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Linux Ubuntu 16.04 
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed 
> References 20190205_218 (JIT enabled, AOT enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>  
>Reporter: Anuja Jakhade
>Priority: Major
>  Labels: BigEndian
> Attachments: DataFrameTungstenSuite.txt, 
> InMemoryColumnarQuerySuite.txt, access only some column of the all of 
> columns.txt
>
>
> While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
> observing test failures for 2 Suites of Project SQL.
>  1. InMemoryColumnarQuerySuite
>  2. DataFrameTungstenSuite
>  In both the cases test "access only some column of the all of columns" fails 
> due to mismatch in the final assert.
> Observed that the data obtained after df.cache() is causing the error. Please 
> find attached the log with the details. 
> cache() works perfectly fine if double and  float values are not in picture.
> Inside test !!- access only some column of the all of columns *** FAILED 
> ***



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian

2019-06-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858318#comment-16858318
 ] 

Apache Spark commented on SPARK-26985:
--

User 'ketank-new' has created a pull request for this issue:
https://github.com/apache/spark/pull/24788

> Test "access only some column of the all of columns " fails on big endian
> -
>
> Key: SPARK-26985
> URL: https://issues.apache.org/jira/browse/SPARK-26985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Linux Ubuntu 16.04 
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed 
> References 20190205_218 (JIT enabled, AOT enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>  
>Reporter: Anuja Jakhade
>Priority: Major
>  Labels: BigEndian
> Attachments: DataFrameTungstenSuite.txt, 
> InMemoryColumnarQuerySuite.txt, access only some column of the all of 
> columns.txt
>
>
> While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
> observing test failures for 2 Suites of Project SQL.
>  1. InMemoryColumnarQuerySuite
>  2. DataFrameTungstenSuite
>  In both the cases test "access only some column of the all of columns" fails 
> due to mismatch in the final assert.
> Observed that the data obtained after df.cache() is causing the error. Please 
> find attached the log with the details. 
> cache() works perfectly fine if double and  float values are not in picture.
> Inside test !!- access only some column of the all of columns *** FAILED 
> ***



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian

2019-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26985:


Assignee: Apache Spark

> Test "access only some column of the all of columns " fails on big endian
> -
>
> Key: SPARK-26985
> URL: https://issues.apache.org/jira/browse/SPARK-26985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Linux Ubuntu 16.04 
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed 
> References 20190205_218 (JIT enabled, AOT enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>  
>Reporter: Anuja Jakhade
>Assignee: Apache Spark
>Priority: Major
>  Labels: BigEndian
> Attachments: DataFrameTungstenSuite.txt, 
> InMemoryColumnarQuerySuite.txt, access only some column of the all of 
> columns.txt
>
>
> While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
> observing test failures for 2 Suites of Project SQL.
>  1. InMemoryColumnarQuerySuite
>  2. DataFrameTungstenSuite
>  In both the cases test "access only some column of the all of columns" fails 
> due to mismatch in the final assert.
> Observed that the data obtained after df.cache() is causing the error. Please 
> find attached the log with the details. 
> cache() works perfectly fine if double and  float values are not in picture.
> Inside test !!- access only some column of the all of columns *** FAILED 
> ***



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27973) Streaming sample DirectKafkaWordCount should mention GroupId in usage

2019-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27973:


Assignee: (was: Apache Spark)

> Streaming sample DirectKafkaWordCount should mention GroupId in usage
> -
>
> Key: SPARK-27973
> URL: https://issues.apache.org/jira/browse/SPARK-27973
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.4.3
>Reporter: Yuexin Zhang
>Priority: Minor
>
> The DirectKafkaWordCount sample has been updated to take Consumer Group Id as 
> one of the input arguments, but we missed it in the sample usage:
>   System.err.println(s"""
> |Usage: DirectKafkaWordCount  
> |   is a list of one or more Kafka brokers
> |   is a consumer group name to consume from topics
> |   is a list of one or more kafka topics to consume from
> |
> """.stripMargin)
> Usage should be : DirectKafkaWordCount   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27973) Streaming sample DirectKafkaWordCount should mention GroupId in usage

2019-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27973:


Assignee: Apache Spark

> Streaming sample DirectKafkaWordCount should mention GroupId in usage
> -
>
> Key: SPARK-27973
> URL: https://issues.apache.org/jira/browse/SPARK-27973
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.4.3
>Reporter: Yuexin Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> The DirectKafkaWordCount sample has been updated to take Consumer Group Id as 
> one of the input arguments, but we missed it in the sample usage:
>   System.err.println(s"""
> |Usage: DirectKafkaWordCount  
> |   is a list of one or more Kafka brokers
> |   is a consumer group name to consume from topics
> |   is a list of one or more kafka topics to consume from
> |
> """.stripMargin)
> Usage should be : DirectKafkaWordCount   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27973) Streaming sample DirectKafkaWordCount should mention GroupId in usage

2019-06-06 Thread Yuexin Zhang (JIRA)
Yuexin Zhang created SPARK-27973:


 Summary: Streaming sample DirectKafkaWordCount should mention 
GroupId in usage
 Key: SPARK-27973
 URL: https://issues.apache.org/jira/browse/SPARK-27973
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 2.4.3
Reporter: Yuexin Zhang


The DirectKafkaWordCount sample has been updated to take Consumer Group Id as 
one of the input arguments, but we missed it in the sample usage:

  System.err.println(s"""
|Usage: DirectKafkaWordCount  
|   is a list of one or more Kafka brokers
|   is a consumer group name to consume from topics
|   is a list of one or more kafka topics to consume from
|
""".stripMargin)

Usage should be : DirectKafkaWordCount   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27928) hadoop

2019-06-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27928:
-
Target Version/s:   (was: 2.4.0)

> hadoop
> --
>
> Key: SPARK-27928
> URL: https://issues.apache.org/jira/browse/SPARK-27928
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.3
>Reporter: Rajesh
>Priority: Major
> Fix For: 2.4.4
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27928) hadoop

2019-06-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27928.
--
Resolution: Invalid

> hadoop
> --
>
> Key: SPARK-27928
> URL: https://issues.apache.org/jira/browse/SPARK-27928
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.3
>Reporter: Rajesh
>Priority: Major
> Fix For: 2.4.4
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27928) hadoop

2019-06-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27928:
-
Fix Version/s: (was: 2.4.4)

> hadoop
> --
>
> Key: SPARK-27928
> URL: https://issues.apache.org/jira/browse/SPARK-27928
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.3
>Reporter: Rajesh
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27956) Allow subqueries as partition filter

2019-06-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27956.
--
Resolution: Duplicate

> Allow subqueries as partition filter
> 
>
> Key: SPARK-27956
> URL: https://issues.apache.org/jira/browse/SPARK-27956
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Johannes Mayer
>Priority: Major
>
> Subqueries are not pushed down as partition filters. See following example
>  
> {code:java}
> create table user_mayerjoh.tab (c1 string)
> partitioned by (c2 string)
> stored as parquet;
> {code}
>  
>  
> {code:java}
> explain select * from user_mayerjoh.tab where c2 < 1;{code}
>  
>   == Physical Plan ==
> *(1) FileScan parquet user_mayerjoh.tab[c1#22,c2#23] Batched: true, Format: 
> Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, 
> *PartitionFilters: [isnotnull(c2#23), (cast(c2#23 as int) < 1)]*, 
> PushedFilters: [], ReadSchema: struct
>  
>  
> {code:java}
> explain select * from user_mayerjoh.tab where c2 < (select 1);{code}
>  
> == Physical Plan ==
>  
> +- *(1) FileScan parquet user_mayerjoh.tab[c1#30,c2#31] Batched: true, 
> Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, 
> *PartitionFilters: [isnotnull(c2#31)]*, PushedFilters: [], ReadSchema: 
> struct
>  
> Is it possible to first execute the subquery and use the result as partition 
> filter?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27966) input_file_name empty when listing files in parallel

2019-06-06 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858291#comment-16858291
 ] 

Hyukjin Kwon commented on SPARK-27966:
--

Can you know reproducible steps? In my env, it always shows the output properly 
and I have no choice but resolving this JIRA.

> input_file_name empty when listing files in parallel
> 
>
> Key: SPARK-27966
> URL: https://issues.apache.org/jira/browse/SPARK-27966
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
> Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
>  
> Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
> Workers: 3
> Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
>Reporter: Christian Homberg
>Priority: Minor
> Attachments: input_file_name_bug
>
>
> I ran into an issue similar and probably related to SPARK-26128. The 
> _org.apache.spark.sql.functions.input_file_name_ is sometimes empty.
>  
> {code:java}
> df.select(input_file_name()).show(5,false)
> {code}
>  
> {code:java}
> +-+
> |input_file_name()|
> +-+
> | |
> | |
> | |
> | |
> | |
> +-+
> {code}
> My environment is databricks and debugging the Log4j output showed me that 
> the issue occurred when the files are being listed in parallel, e.g. when 
> {code:java}
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 127; threshold: 32
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under:{code}
>  
> Everything's fine as long as
> {code:java}
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 6; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> {code}
>  
> Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  
> resolves the issue for me.
>  
> *edit: the problem is not exclusively linked to listing files in parallel. 
> I've setup a larger cluster for which after parallel file listing the 
> input_file_name did return the correct filename. After inspecting the log4j 
> again, I assume that it's linked to some kind of MetaStore being full. I've 
> attached a section of the log4j output that I think should indicate why it's 
> failing. If you need more, please let me know.*
>  ** 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27970) Support Hive 3.0 metastore

2019-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27970:


Assignee: (was: Apache Spark)

> Support Hive 3.0 metastore
> --
>
> Key: SPARK-27970
> URL: https://issues.apache.org/jira/browse/SPARK-27970
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> It seems that some users are using Hive 3.0.0, at least HDP 3.0.0:
> !https://camo.githubusercontent.com/736d8a9f04d3960e0cdc3a8ee09aa199ce103b51/68747470733a2f2f32786262686a786336776b3376323170363274386e3464342d7770656e67696e652e6e6574646e612d73736c2e636f6d2f77702d636f6e74656e742f75706c6f6164732f323031382f31322f6864702d332e312e312d4173706172616775732e706e67!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27970) Support Hive 3.0 metastore

2019-06-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858288#comment-16858288
 ] 

Apache Spark commented on SPARK-27970:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/24688

> Support Hive 3.0 metastore
> --
>
> Key: SPARK-27970
> URL: https://issues.apache.org/jira/browse/SPARK-27970
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> It seems that some users are using Hive 3.0.0, at least HDP 3.0.0:
> !https://camo.githubusercontent.com/736d8a9f04d3960e0cdc3a8ee09aa199ce103b51/68747470733a2f2f32786262686a786336776b3376323170363274386e3464342d7770656e67696e652e6e6574646e612d73736c2e636f6d2f77702d636f6e74656e742f75706c6f6164732f323031382f31322f6864702d332e312e312d4173706172616775732e706e67!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27970) Support Hive 3.0 metastore

2019-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27970:


Assignee: Apache Spark

> Support Hive 3.0 metastore
> --
>
> Key: SPARK-27970
> URL: https://issues.apache.org/jira/browse/SPARK-27970
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
> Attachments: screenshot-1.png
>
>
> It seems that some users are using Hive 3.0.0, at least HDP 3.0.0:
> !https://camo.githubusercontent.com/736d8a9f04d3960e0cdc3a8ee09aa199ce103b51/68747470733a2f2f32786262686a786336776b3376323170363274386e3464342d7770656e67696e652e6e6574646e612d73736c2e636f6d2f77702d636f6e74656e742f75706c6f6164732f323031382f31322f6864702d332e312e312d4173706172616775732e706e67!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-06-06 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858285#comment-16858285
 ] 

Hyukjin Kwon commented on SPARK-27927:
--

Does this happen in non-Kubernates envs?

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27962) Propagate subprocess stdout when subprocess exits with nonzero status in deploy.RRunner

2019-06-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27962:
-
Component/s: SparkR

> Propagate subprocess stdout when subprocess exits with nonzero status in 
> deploy.RRunner
> ---
>
> Key: SPARK-27962
> URL: https://issues.apache.org/jira/browse/SPARK-27962
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core, SparkR
>Affects Versions: 2.4.3
>Reporter: Jeremy Liu
>Priority: Minor
>
> When the R process launched in org.apache.spark.deploy.RRunner terminates 
> with a nonzero status code, only the status code is passed on in the 
> SparkUserAppException.
> Although the subprocess' stdout is continually piped to System.out, it would 
> be useful for users without access to the JVM's stdout to also capture the 
> last few lines of the R process and pass it along in the exception message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27938) Turn on LEGACY_PASS_PARTITION_BY_AS_OPTIONS by default

2019-06-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27938:


Assignee: Liwen Sun

> Turn on LEGACY_PASS_PARTITION_BY_AS_OPTIONS by default
> --
>
> Key: SPARK-27938
> URL: https://issues.apache.org/jira/browse/SPARK-27938
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liwen Sun
>Assignee: Liwen Sun
>Priority: Major
>
> In SPARK-27453, we added a config {{LEGACY_PASS_PARTITION_BY_AS_OPTIONS}} for 
> patch release 2.4.3, where we keep this config default as false so it's not 
> intrusive. We can turn it on by default for Spark 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27938) Turn on LEGACY_PASS_PARTITION_BY_AS_OPTIONS by default

2019-06-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27938.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24784
[https://github.com/apache/spark/pull/24784]

> Turn on LEGACY_PASS_PARTITION_BY_AS_OPTIONS by default
> --
>
> Key: SPARK-27938
> URL: https://issues.apache.org/jira/browse/SPARK-27938
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liwen Sun
>Assignee: Liwen Sun
>Priority: Major
> Fix For: 3.0.0
>
>
> In SPARK-27453, we added a config {{LEGACY_PASS_PARTITION_BY_AS_OPTIONS}} for 
> patch release 2.4.3, where we keep this config default as false so it's not 
> intrusive. We can turn it on by default for Spark 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27971) MapPartitionsInRWithArrowExec.evaluate shouldn't eagerly read the first batch

2019-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27971:


Assignee: Apache Spark

> MapPartitionsInRWithArrowExec.evaluate shouldn't eagerly read the first batch
> -
>
> Key: SPARK-27971
> URL: https://issues.apache.org/jira/browse/SPARK-27971
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Same as SPARK-27968 but R's dapply one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27971) MapPartitionsInRWithArrowExec.evaluate shouldn't eagerly read the first batch

2019-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27971:


Assignee: (was: Apache Spark)

> MapPartitionsInRWithArrowExec.evaluate shouldn't eagerly read the first batch
> -
>
> Key: SPARK-27971
> URL: https://issues.apache.org/jira/browse/SPARK-27971
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Same as SPARK-27968 but R's dapply one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27971) MapPartitionsInRWithArrowExec.evaluate shouldn't eagerly read the first batch

2019-06-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27971:
-
Summary: MapPartitionsInRWithArrowExec.evaluate shouldn't eagerly read the 
first batch  (was: MapPartitionsInRWithArrowExec.evaluate shouldn't eagerly 
read the first row)

> MapPartitionsInRWithArrowExec.evaluate shouldn't eagerly read the first batch
> -
>
> Key: SPARK-27971
> URL: https://issues.apache.org/jira/browse/SPARK-27971
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Same as SPARK-27968 but R's dapply one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27968) ArrowEvalPythonExec.evaluate shouldn't eagerly read the first batch

2019-06-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27968:
-
Summary: ArrowEvalPythonExec.evaluate shouldn't eagerly read the first 
batch  (was: ArrowEvalPythonExec.evaluate shouldn't eagerly read the first row)

> ArrowEvalPythonExec.evaluate shouldn't eagerly read the first batch
> ---
>
> Key: SPARK-27968
> URL: https://issues.apache.org/jira/browse/SPARK-27968
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
> Fix For: 3.0.0
>
>
> An issue mentioned here: 
> https://github.com/apache/spark/pull/24734/files#r288377915, could be 
> decoupled from that PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27972) Move SQL migration guide to the top level

2019-06-06 Thread Xiao Li (JIRA)
Xiao Li created SPARK-27972:
---

 Summary: Move SQL migration guide to the top level
 Key: SPARK-27972
 URL: https://issues.apache.org/jira/browse/SPARK-27972
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Xiao Li


Currently, only SQL and MLLib  have the dedicated section for documenting 
behavior changes and breaking changes. We found these guides simplify the 
upgrade experience for end users. 

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

[https://spark.apache.org/docs/2.4.3/ml-guide.html#migration-guide]

The other components can do similar things in the same doc. Here, we propose to 
combine the migration guides and move it to the top level. All the components 
can document their behavior changes in the same PR that introduced the changes. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27971) MapPartitionsInRWithArrowExec.evaluate shouldn't eagerly read the first row

2019-06-06 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-27971:


 Summary: MapPartitionsInRWithArrowExec.evaluate shouldn't eagerly 
read the first row
 Key: SPARK-27971
 URL: https://issues.apache.org/jira/browse/SPARK-27971
 Project: Spark
  Issue Type: Improvement
  Components: SparkR, SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


Same as SPARK-27968 but R's dapply one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27969) Non-deterministic expressions in filters or projects can unnecessarily prevent all scan-time column pruning, harming performance

2019-06-06 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858208#comment-16858208
 ] 

Josh Rosen commented on SPARK-27969:


It looks like this issue has been reported twice in the past:
 * SPARK-14172 described a case where non-deterministic filters could prevent 
Hive partition pruning from occurring (which can have a huge performance 
impact!)
 * SPARK-21520 is a near-exact duplicate, showing how non-deterministic 
projections prevent column pruning in HiveTableScan.

It looks like [~jiangxb1987] tried to fix this back in 2016 in 
[https://github.com/apache/spark/pull/13893] and [~heary-cao] attempted a 
different fix in 2017 in [https://github.com/apache/spark/pull/18969]

> Non-deterministic expressions in filters or projects can unnecessarily 
> prevent all scan-time column pruning, harming performance
> 
>
> Key: SPARK-27969
> URL: https://issues.apache.org/jira/browse/SPARK-27969
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> If a scan operator is followed by a projection or filter and those operators 
> contain _any_ non-deterministic expressions then scan column pruning 
> optimizations are completely skipped, harming query performance.
> Here's an example of the problem:
> {code:java}
> import org.apache.spark.sql.functions._
> val df = spark.createDataset(Seq(
>   (1, 2, 3, 4, 5),
>   (1, 2, 3, 4, 5)
> ))
> val tmpPath = 
> java.nio.file.Files.createTempDirectory("column-pruning-bug").toString()
> df.write.parquet(tmpPath)
> val fromParquet = spark.read.parquet(tmpPath){code}
> If all expressions are deterministic then, as expected, column pruning is 
> pushed into the scan
> {code:java}
> fromParquet.select("_1").explain
> == Physical Plan == *(1) FileScan parquet [_1#68] Batched: true, DataFilters: 
> [], Format: Parquet, Location: 
> InMemoryFileIndex[dbfs:/local_disk0/tmp/column-pruning-bug7865798834978814548],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:int>{code}
> However, if we add a non-deterministic filter then no column pruning is 
> performed (even though pruning would be safe!):
> {code:java}
> fromParquet.select("_1").filter(rand() =!= 0).explain
> == Physical Plan ==
> *(1) Project [_1#127]
> +- *(1) Filter NOT (rand(-1515289268025792238) = 0.0)
> +- *(1) FileScan parquet [_1#127,_2#128,_3#129,_4#130,_5#131] Batched: true, 
> DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[dbfs:/local_disk0/tmp/column-pruning-bug4043817424882943496],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<_1:int,_2:int,_3:int,_4:int,_5:int>{code}
> Similarly, a non-deterministic expression in a second projection can end up 
> being collapsed such that it prevents column pruning:
> {code:java}
> fromParquet.select("_1").select($"_1", rand()).explain
> == Physical Plan ==
> *(1) Project [_1#127, rand(1267140591146561563) AS 
> rand(1267140591146561563)#141]
> +- *(1) FileScan parquet [_1#127,_2#128,_3#129,_4#130,_5#131] Batched: true, 
> DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[dbfs:/local_disk0/tmp/column-pruning-bug4043817424882943496],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<_1:int,_2:int,_3:int,_4:int,_5:int>
> {code}
> I believe that this is caused by SPARK-10316: the Parquet column pruning code 
> relies on the [{{PhysicalOperation}} unapply 
> method|https://github.com/apache/spark/blob/v2.4.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L43]
>  for extracting projects and filters and this helper purposely fails to match 
> if _any_ projection or filter is non-deterministic.
> It looks like this conservative behavior may have originally been added to 
> avoid pushdown / re-ordering of non-deterministic filter expressions. 
> However, in this case I feel that it's _too_ conservative: even though we 
> can't push down non-deterministic filters we should still be able to perform 
> column pruning. 
> /cc [~cloud_fan] and [~marmbrus] (it looks like you [discussed collapsing of 
> non-deterministic 
> projects|https://github.com/apache/spark/pull/8486#issuecomment-136036533] in 
> the SPARK-10316 PR, which is related to why the third example above did not 
> prune).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27970) Support Hive 3.0 metastore

2019-06-06 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27970:

Description: 
It seems that some users are using Hive 3.0.0, at least HDP 3.0.0:
!https://camo.githubusercontent.com/736d8a9f04d3960e0cdc3a8ee09aa199ce103b51/68747470733a2f2f32786262686a786336776b3376323170363274386e3464342d7770656e67696e652e6e6574646e612d73736c2e636f6d2f77702d636f6e74656e742f75706c6f6164732f323031382f31322f6864702d332e312e312d4173706172616775732e706e67!
 



  was:
It seems that some users are using Hive 3.0.0, at least HDP 3.0.0:



https://camo.githubusercontent.com/736d8a9f04d3960e0cdc3a8ee09aa199ce103b51/68747470733a2f2f32786262686a786336776b3376323170363274386e3464342d7770656e67696e652e6e6574646e612d73736c2e636f6d2f77702d636f6e74656e742f75706c6f6164732f323031382f31322f6864702d332e312e312d4173706172616775732e706e67


> Support Hive 3.0 metastore
> --
>
> Key: SPARK-27970
> URL: https://issues.apache.org/jira/browse/SPARK-27970
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> It seems that some users are using Hive 3.0.0, at least HDP 3.0.0:
> !https://camo.githubusercontent.com/736d8a9f04d3960e0cdc3a8ee09aa199ce103b51/68747470733a2f2f32786262686a786336776b3376323170363274386e3464342d7770656e67696e652e6e6574646e612d73736c2e636f6d2f77702d636f6e74656e742f75706c6f6164732f323031382f31322f6864702d332e312e312d4173706172616775732e706e67!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27970) Support Hive 3.0 metastore

2019-06-06 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27970:

Attachment: screenshot-1.png

> Support Hive 3.0 metastore
> --
>
> Key: SPARK-27970
> URL: https://issues.apache.org/jira/browse/SPARK-27970
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> It seems that some users are using Hive 3.0.0, at least HDP 3.0.0:
> https://camo.githubusercontent.com/736d8a9f04d3960e0cdc3a8ee09aa199ce103b51/68747470733a2f2f32786262686a786336776b3376323170363274386e3464342d7770656e67696e652e6e6574646e612d73736c2e636f6d2f77702d636f6e74656e742f75706c6f6164732f323031382f31322f6864702d332e312e312d4173706172616775732e706e67



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27970) Support Hive 3.0 metastore

2019-06-06 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-27970:
---

 Summary: Support Hive 3.0 metastore
 Key: SPARK-27970
 URL: https://issues.apache.org/jira/browse/SPARK-27970
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


It seems that some users are using Hive 3.0.0, at least HDP 3.0.0:



https://camo.githubusercontent.com/736d8a9f04d3960e0cdc3a8ee09aa199ce103b51/68747470733a2f2f32786262686a786336776b3376323170363274386e3464342d7770656e67696e652e6e6574646e612d73736c2e636f6d2f77702d636f6e74656e742f75706c6f6164732f323031382f31322f6864702d332e312e312d4173706172616775732e706e67



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27969) Non-deterministic expressions in filters or projects can unnecessarily prevent all scan-time column pruning, harming performance

2019-06-06 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-27969:
---
Component/s: Optimizer

> Non-deterministic expressions in filters or projects can unnecessarily 
> prevent all scan-time column pruning, harming performance
> 
>
> Key: SPARK-27969
> URL: https://issues.apache.org/jira/browse/SPARK-27969
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> If a scan operator is followed by a projection or filter and those operators 
> contain _any_ non-deterministic expressions then scan column pruning 
> optimizations are completely skipped, harming query performance.
> Here's an example of the problem:
> {code:java}
> import org.apache.spark.sql.functions._
> val df = spark.createDataset(Seq(
>   (1, 2, 3, 4, 5),
>   (1, 2, 3, 4, 5)
> ))
> val tmpPath = 
> java.nio.file.Files.createTempDirectory("column-pruning-bug").toString()
> df.write.parquet(tmpPath)
> val fromParquet = spark.read.parquet(tmpPath){code}
> If all expressions are deterministic then, as expected, column pruning is 
> pushed into the scan
> {code:java}
> fromParquet.select("_1").explain
> == Physical Plan == *(1) FileScan parquet [_1#68] Batched: true, DataFilters: 
> [], Format: Parquet, Location: 
> InMemoryFileIndex[dbfs:/local_disk0/tmp/column-pruning-bug7865798834978814548],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:int>{code}
> However, if we add a non-deterministic filter then no column pruning is 
> performed (even though pruning would be safe!):
> {code:java}
> fromParquet.select("_1").filter(rand() =!= 0).explain
> == Physical Plan ==
> *(1) Project [_1#127]
> +- *(1) Filter NOT (rand(-1515289268025792238) = 0.0)
> +- *(1) FileScan parquet [_1#127,_2#128,_3#129,_4#130,_5#131] Batched: true, 
> DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[dbfs:/local_disk0/tmp/column-pruning-bug4043817424882943496],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<_1:int,_2:int,_3:int,_4:int,_5:int>{code}
> Similarly, a non-deterministic expression in a second projection can end up 
> being collapsed such that it prevents column pruning:
> {code:java}
> fromParquet.select("_1").select($"_1", rand()).explain
> == Physical Plan ==
> *(1) Project [_1#127, rand(1267140591146561563) AS 
> rand(1267140591146561563)#141]
> +- *(1) FileScan parquet [_1#127,_2#128,_3#129,_4#130,_5#131] Batched: true, 
> DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[dbfs:/local_disk0/tmp/column-pruning-bug4043817424882943496],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<_1:int,_2:int,_3:int,_4:int,_5:int>
> {code}
> I believe that this is caused by SPARK-10316: the Parquet column pruning code 
> relies on the [{{PhysicalOperation}} unapply 
> method|https://github.com/apache/spark/blob/v2.4.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L43]
>  for extracting projects and filters and this helper purposely fails to match 
> if _any_ projection or filter is non-deterministic.
> It looks like this conservative behavior may have originally been added to 
> avoid pushdown / re-ordering of non-deterministic filter expressions. 
> However, in this case I feel that it's _too_ conservative: even though we 
> can't push down non-deterministic filters we should still be able to perform 
> column pruning. 
> /cc [~cloud_fan] and [~marmbrus] (it looks like you [discussed collapsing of 
> non-deterministic 
> projects|https://github.com/apache/spark/pull/8486#issuecomment-136036533] in 
> the SPARK-10316 PR, which is related to why the third example above did not 
> prune).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27761) Make UDF nondeterministic by default(?)

2019-06-06 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858155#comment-16858155
 ] 

Josh Rosen commented on SPARK-27761:


FYI, I'm marking SPARK-27969 as a blocker to this because non-deterministic 
expressions can unnecessarily prevent scan-time column pruning: as a result, a 
change of default could lead to massive performance regressions when users 
upgrade to 3.0.

> Make UDF nondeterministic by default(?)
> ---
>
> Key: SPARK-27761
> URL: https://issues.apache.org/jira/browse/SPARK-27761
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Sunitha Kambhampati
>Priority: Minor
>
> Opening this issue as a followup from a discussion/question on this PR for an 
> optimization involving deterministic udf: 
> https://github.com/apache/spark/pull/24593#pullrequestreview-237361795  
> "We even should discuss whether all UDFs must be deterministic or 
> non-deterministic by default."
> Basically today in Spark 2.4,  Scala UDFs are marked deterministic by default 
> and it is implicit. To mark a udf as non deterministic, they need to call 
> this method asNondeterministic().
> The concern's expressed are that users are not aware of this property and its 
> implications.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27969) Non-deterministic expressions in filters or projects can unnecessarily prevent all scan-time column pruning, harming performance

2019-06-06 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-27969:
--

 Summary: Non-deterministic expressions in filters or projects can 
unnecessarily prevent all scan-time column pruning, harming performance
 Key: SPARK-27969
 URL: https://issues.apache.org/jira/browse/SPARK-27969
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Josh Rosen


If a scan operator is followed by a projection or filter and those operators 
contain _any_ non-deterministic expressions then scan column pruning 
optimizations are completely skipped, harming query performance.

Here's an example of the problem:
{code:java}
import org.apache.spark.sql.functions._
val df = spark.createDataset(Seq(
  (1, 2, 3, 4, 5),
  (1, 2, 3, 4, 5)
))
val tmpPath = 
java.nio.file.Files.createTempDirectory("column-pruning-bug").toString()
df.write.parquet(tmpPath)
val fromParquet = spark.read.parquet(tmpPath){code}
If all expressions are deterministic then, as expected, column pruning is 
pushed into the scan
{code:java}
fromParquet.select("_1").explain

== Physical Plan == *(1) FileScan parquet [_1#68] Batched: true, DataFilters: 
[], Format: Parquet, Location: 
InMemoryFileIndex[dbfs:/local_disk0/tmp/column-pruning-bug7865798834978814548], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:int>{code}
However, if we add a non-deterministic filter then no column pruning is 
performed (even though pruning would be safe!):
{code:java}
fromParquet.select("_1").filter(rand() =!= 0).explain

== Physical Plan ==
*(1) Project [_1#127]
+- *(1) Filter NOT (rand(-1515289268025792238) = 0.0)
+- *(1) FileScan parquet [_1#127,_2#128,_3#129,_4#130,_5#131] Batched: true, 
DataFilters: [], Format: Parquet, Location: 
InMemoryFileIndex[dbfs:/local_disk0/tmp/column-pruning-bug4043817424882943496], 
PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct<_1:int,_2:int,_3:int,_4:int,_5:int>{code}
Similarly, a non-deterministic expression in a second projection can end up 
being collapsed such that it prevents column pruning:
{code:java}
fromParquet.select("_1").select($"_1", rand()).explain

== Physical Plan ==
*(1) Project [_1#127, rand(1267140591146561563) AS 
rand(1267140591146561563)#141]
+- *(1) FileScan parquet [_1#127,_2#128,_3#129,_4#130,_5#131] Batched: true, 
DataFilters: [], Format: Parquet, Location: 
InMemoryFileIndex[dbfs:/local_disk0/tmp/column-pruning-bug4043817424882943496], 
PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct<_1:int,_2:int,_3:int,_4:int,_5:int>
{code}
I believe that this is caused by SPARK-10316: the Parquet column pruning code 
relies on the [{{PhysicalOperation}} unapply 
method|https://github.com/apache/spark/blob/v2.4.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L43]
 for extracting projects and filters and this helper purposely fails to match 
if _any_ projection or filter is non-deterministic.

It looks like this conservative behavior may have originally been added to 
avoid pushdown / re-ordering of non-deterministic filter expressions. However, 
in this case I feel that it's _too_ conservative: even though we can't push 
down non-deterministic filters we should still be able to perform column 
pruning. 

/cc [~cloud_fan] and [~marmbrus] (it looks like you [discussed collapsing of 
non-deterministic 
projects|https://github.com/apache/spark/pull/8486#issuecomment-136036533] in 
the SPARK-10316 PR, which is related to why the third example above did not 
prune).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27968) ArrowEvalPythonExec.evaluate shouldn't eagerly read the first row

2019-06-06 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-27968.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24816
[https://github.com/apache/spark/pull/24816]

> ArrowEvalPythonExec.evaluate shouldn't eagerly read the first row
> -
>
> Key: SPARK-27968
> URL: https://issues.apache.org/jira/browse/SPARK-27968
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
> Fix For: 3.0.0
>
>
> An issue mentioned here: 
> https://github.com/apache/spark/pull/24734/files#r288377915, could be 
> decoupled from that PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2019-06-06 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-26412:
--
Description: 
Pandas UDF is the ideal connection between PySpark and DL model inference 
workload. However, user needs to load the model file first to make predictions. 
It is common to see models of size ~100MB or bigger. If the Pandas UDF 
execution is limited to each batch, user needs to repeatedly load the same 
model for every batch in the same python worker process, which is inefficient.

We can provide users the iterator of batches in pd.DataFrame and let user code 
handle it:

{code}
@pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
def predict(batch_iter):
  model = ... # load model
  for batch in batch_iter:
yield model.predict(batch)
{code}

The type of each batch is:
* a pd.Series if UDF is called with a single non-struct-type column
* a tuple of pd.Series if UDF is called with more than one Spark DF columns
* a pd.DataFrame if UDF is called with a single StructType column

Examples:

{code}
@pandas_udf(...)
def evaluate(batch_iter):
  model = ... # load model
  for features, label in batch_iter:
pred = model.predict(features)
yield (pred - label).abs()

df.select(evaluate(col("features"), col("label")).alias("err"))
{code}

{code}
@pandas_udf(...)
def evaluate(pdf_iter):
  model = ... # load model
  for pdf in pdf_iter:
pred = model.predict(pdf['x'])
yield (pred - pdf['y']).abs()

df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
{code}

If the UDF doesn't return the same number of records for the entire partition, 
user should see an error. We don't restrict that every yield should match the 
input batch size.

Another benefit is with iterator interface and asyncio from Python, it is 
flexible for users to implement data pipelining.

cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]

  was:
Pandas UDF is the ideal connection between PySpark and DL model inference 
workload. However, user needs to load the model file first to make predictions. 
It is common to see models of size ~100MB or bigger. If the Pandas UDF 
execution is limited to each batch, user needs to repeatedly load the same 
model for every batch in the same python worker process, which is inefficient.

We can provide users the iterator of batches in pd.DataFrame and let user code 
handle it:

{code}
@pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
def predict(batch_iter):
  model = ... # load model
  for batch in batch_iter:
yield model.predict(batch)
{code}

The type of each batch is:
* a pd.Series if UDF is called with a single non-struct-type column
* a tuple of pd.Series if predict is called with more than one Spark DF columns
* a pd.DataFrame if predict is called with a single StructType column

{code}
@pandas_udf(...)
def evaluate(batch_iter):
  model = ... # load model
  for features, label in batch_iter:
pred = model.predict(features)
yield (pred - label).abs()

df.select(evaluate(col("features"), col("label")).alias("err"))
{code}

{code}
@pandas_udf(...)
def evaluate(pdf_iter):
  model = ... # load model
  for pdf in pdf_iter:
pred = model.predict(pdf['x'])
yield (pred - pdf['y']).abs()

df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
{code}

Another benefit is with iterator interface and asyncio from Python, it is 
flexible for users to implement data pipelining.

cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]


> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to each batch, user needs to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient.
> We can provide users the iterator of batches in pd.DataFrame and let user 
> code handle it:
> {code}
> @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
> def predict(batch_iter):
>   model = ... # load model
>   for batch in batch_iter:
> yield model.predict(batch)
> {code}
> The type of each batch is:
> * a pd.Series if UDF is called with a single non-struct-type column
> * a tuple of pd.Series if UDF is called with more than one Spark DF columns
> * a pd.DataFrame if UDF is called with a single StructType column
> 

[jira] [Updated] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2019-06-06 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-26412:
--
Description: 
Pandas UDF is the ideal connection between PySpark and DL model inference 
workload. However, user needs to load the model file first to make predictions. 
It is common to see models of size ~100MB or bigger. If the Pandas UDF 
execution is limited to each batch, user needs to repeatedly load the same 
model for every batch in the same python worker process, which is inefficient.

We can provide users the iterator of batches in pd.DataFrame and let user code 
handle it:

{code}
@pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
def predict(batch_iter):
  model = ... # load model
  for batch in batch_iter:
yield model.predict(batch)
{code}

The type of each batch is:
* a pd.Series if UDF is called with a single non-struct-type column
* a tuple of pd.Series if predict is called with more than one Spark DF columns
* a pd.DataFrame if predict is called with a single StructType column

{code}
@pandas_udf(...)
def evaluate(batch_iter):
  model = ... # load model
  for features, label in batch_iter:
pred = model.predict(features)
yield (pred - label).abs()

df.select(evaluate(col("features"), col("label")).alias("err"))
{code}

{code}
@pandas_udf(...)
def evaluate(pdf_iter):
  model = ... # load model
  for pdf in pdf_iter:
pred = model.predict(pdf['x'])
yield (pred - pdf['y']).abs()

df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
{code}

Another benefit is with iterator interface and asyncio from Python, it is 
flexible for users to implement data pipelining.

cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]

  was:
Pandas UDF is the ideal connection between PySpark and DL model inference 
workload. However, user needs to load the model file first to make predictions. 
It is common to see models of size ~100MB or bigger. If the Pandas UDF 
execution is limited to each batch, user needs to repeatedly load the same 
model for every batch in the same python worker process, which is inefficient.

We can provide users the iterator of batches in pd.DataFrame and let user code 
handle it:

{code}
@pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
def predict(batch_iter):
  model = ... # load model
  for batch in batch_iter:
yield model.predict(batch)
{code}

The type of each batch is:
* pd.Series if UDF is called with a single non-struct-type column
* a tuple of pd.Series if predict is called with more than one Spark DF columns
* a pd.DataFrame if predict is called with a single StructType column

{code}
@pandas_udf(...)
def evaluate(batch_iter):
  model = ... # load model
  for features, label in batch_iter:
pred = model.predict(features)
yield (pred - label).abs()

df.select(evaluate(col("features"), col("label")).alias("err"))
{code}

{code}
@pandas_udf(...)
def evaluate(pdf_iter):
  model = ... # load model
  for pdf in pdf_iter:
pred = model.predict(pdf['x'])
yield (pred - pdf['y']).abs()

df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
{code}

Another benefit is with iterator interface and asyncio from Python, it is 
flexible for users to implement data pipelining.

cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]


> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to each batch, user needs to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient.
> We can provide users the iterator of batches in pd.DataFrame and let user 
> code handle it:
> {code}
> @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
> def predict(batch_iter):
>   model = ... # load model
>   for batch in batch_iter:
> yield model.predict(batch)
> {code}
> The type of each batch is:
> * a pd.Series if UDF is called with a single non-struct-type column
> * a tuple of pd.Series if predict is called with more than one Spark DF 
> columns
> * a pd.DataFrame if predict is called with a single StructType column
> {code}
> @pandas_udf(...)
> def evaluate(batch_iter):
>   model = ... # load model
>   for features, label in batch_iter:
> pred = model.predict(features)
> yield 

[jira] [Updated] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2019-06-06 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-26412:
--
Description: 
Pandas UDF is the ideal connection between PySpark and DL model inference 
workload. However, user needs to load the model file first to make predictions. 
It is common to see models of size ~100MB or bigger. If the Pandas UDF 
execution is limited to each batch, user needs to repeatedly load the same 
model for every batch in the same python worker process, which is inefficient.

We can provide users the iterator of batches in pd.DataFrame and let user code 
handle it:

{code}
@pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
def predict(batch_iter):
  model = ... # load model
  for batch in batch_iter:
yield model.predict(batch)
{code}

The type of each batch is:
* pd.Series if UDF is called with a single non-struct-type column
* a tuple of pd.Series if predict is called with more than one Spark DF columns
* a pd.DataFrame if predict is called with a single StructType column

{code}
@pandas_udf(...)
def evaluate(batch_iter):
  model = ... # load model
  for features, label in batch_iter:
pred = model.predict(features)
yield (pred - label).abs()

df.select(evaluate(col("features"), col("label")).alias("err"))
{code}

{code}
@pandas_udf(...)
def evaluate(pdf_iter):
  model = ... # load model
  for pdf in pdf_iter:
pred = model.predict(pdf['x'])
yield (pred - pdf['y']).abs()

df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
{code}

Another benefit is with iterator interface and asyncio from Python, it is 
flexible for users to implement data pipelining.

cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]

  was:
Pandas UDF is the ideal connection between PySpark and DL model inference 
workload. However, user needs to load the model file first to make predictions. 
It is common to see models of size ~100MB or bigger. If the Pandas UDF 
execution is limited to each batch, user needs to repeatedly load the same 
model for every batch in the same python worker process, which is inefficient.

We can provide users the iterator of batches in pd.DataFrame and let user code 
handle it:

{code}
@pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITERATOR)
def predict(batch_iter):
  model = ... # load model
  for batch in batch_iter:
yield model.predict(batch)
{code}

We might add a contract that each yield must match the corresponding batch size.

Another benefit is with iterator interface and asyncio from Python, it is 
flexible for users to implement data pipelining.

cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]


> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to each batch, user needs to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient.
> We can provide users the iterator of batches in pd.DataFrame and let user 
> code handle it:
> {code}
> @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
> def predict(batch_iter):
>   model = ... # load model
>   for batch in batch_iter:
> yield model.predict(batch)
> {code}
> The type of each batch is:
> * pd.Series if UDF is called with a single non-struct-type column
> * a tuple of pd.Series if predict is called with more than one Spark DF 
> columns
> * a pd.DataFrame if predict is called with a single StructType column
> {code}
> @pandas_udf(...)
> def evaluate(batch_iter):
>   model = ... # load model
>   for features, label in batch_iter:
> pred = model.predict(features)
> yield (pred - label).abs()
> df.select(evaluate(col("features"), col("label")).alias("err"))
> {code}
> {code}
> @pandas_udf(...)
> def evaluate(pdf_iter):
>   model = ... # load model
>   for pdf in pdf_iter:
> pred = model.predict(pdf['x'])
> yield (pred - pdf['y']).abs()
> df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
> {code}
> Another benefit is with iterator interface and asyncio from Python, it is 
> flexible for users to implement data pipelining.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (SPARK-27963) Allow dynamic allocation without an external shuffle service

2019-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27963:


Assignee: Apache Spark

> Allow dynamic allocation without an external shuffle service
> 
>
> Key: SPARK-27963
> URL: https://issues.apache.org/jira/browse/SPARK-27963
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Major
>
> It would be useful for users to be able to enable dynamic allocation without 
> the need to provision an external shuffle service. One immediate use case is 
> the ability to use dynamic allocation on Kubernetes, which doesn't yet have 
> that service.
> This has been suggested before (e.g. 
> https://github.com/apache/spark/pull/24083, which was attached to the 
> k8s-specific SPARK-24432), and can actually be done without affecting the 
> internals of the Spark scheduler (aside from the dynamic allocation code). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27963) Allow dynamic allocation without an external shuffle service

2019-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27963:


Assignee: (was: Apache Spark)

> Allow dynamic allocation without an external shuffle service
> 
>
> Key: SPARK-27963
> URL: https://issues.apache.org/jira/browse/SPARK-27963
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> It would be useful for users to be able to enable dynamic allocation without 
> the need to provision an external shuffle service. One immediate use case is 
> the ability to use dynamic allocation on Kubernetes, which doesn't yet have 
> that service.
> This has been suggested before (e.g. 
> https://github.com/apache/spark/pull/24083, which was attached to the 
> k8s-specific SPARK-24432), and can actually be done without affecting the 
> internals of the Spark scheduler (aside from the dynamic allocation code). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27760) Spark resources - user configs change .count to be .amount

2019-06-06 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-27760.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

> Spark resources - user configs change .count to be .amount
> --
>
> Key: SPARK-27760
> URL: https://issues.apache.org/jira/browse/SPARK-27760
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.0
>
>
> For the Spark resources, we created the config
> spark.\{driver/executor}.resource.\{resourceName}.count
> I think we should change .count to be .amount. That more easily allows users 
> to specify things with suffix like memory in a single config and they can 
> combine the value and unit. Without this they would have to specify 2 
> separate configs (like .count and .unit) which seems more of a hassle for the 
> user.
> Note the yarn configs for resources use amount:  
> spark.yarn.\{executor/driver/am}.resource=, where the  amont> is value and unit together. I think that makes a lot of sense. Filed a 
> separate Jira to add .amount to the yarn configs as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24523) InterruptedException when closing SparkContext

2019-06-06 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857998#comment-16857998
 ] 

Steve Loughran commented on SPARK-24523:


In HDFS:
{code}
if (!file.isUnderConstruction()) {
  throw new LeaseExpiredException("File is not open for writing: "
  + leaseExceptionString(src, fileId, holder));
}
{code}

to get there the file must exist, but the Namenode doesn't think the file is 
being written to any more: the caller doesn't have a lease on it

> InterruptedException when closing SparkContext
> --
>
> Key: SPARK-24523
> URL: https://issues.apache.org/jira/browse/SPARK-24523
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0, 2.3.1
> Environment: EMR 5.14.0, S3/HDFS inputs and outputs; EMR 5.17
>  
>  
>  
>Reporter: Umayr Hassan
>Priority: Major
> Attachments: spark-stop-jstack.log.1, spark-stop-jstack.log.2, 
> spark-stop-jstack.log.3, thread-dump.log
>
>
> I'm running a Scala application in EMR with the following properties:
> {{--master yarn --deploy-mode cluster --driver-memory 13g --executor-memory 
> 30g --executor-cores 5 --conf spark.default.parallelism=400 --conf 
> spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.maxExecutors=20 --conf 
> spark.eventLog.dir=hdfs:///var/log/spark/apps --conf 
> spark.eventLog.enabled=true --conf 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf 
> spark.scheduler.listenerbus.eventqueue.capacity=2 --conf 
> spark.shuffle.service.enabled=true --conf spark.sql.shuffle.partitions=400 
> --conf spark.yarn.maxAppAttempts=1}}
> The application runs fine till SparkContext is (automatically) closed, at 
> which point the SparkContext object throws. 
> {{18/06/10 10:44:43 ERROR Utils: Uncaught exception in thread pool-4-thread-1 
> java.lang.InterruptedException at java.lang.Object.wait(Native Method) at 
> java.lang.Thread.join(Thread.java:1252) at 
> java.lang.Thread.join(Thread.java:1326) at 
> org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:133) at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:54) at 
> org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219) at 
> org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1915)
>  at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at 
> org.apache.spark.SparkContext.stop(SparkContext.scala:1914) at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) 
> at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) 
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at scala.util.Try$.apply(Try.scala:192) at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)}}
>  
> I've not seen this behavior in Spark 2.0.2 and Spark 2.2.0 (for the same 
> application), so I'm not sure which change is causing Spark 2.3 to throw. Any 
> ideas?
> best,
> Umayr



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To 

[jira] [Commented] (SPARK-24523) InterruptedException when closing SparkContext

2019-06-06 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857995#comment-16857995
 ] 

Steve Loughran commented on SPARK-24523:


The HDFS failure lookjs more serious; the timeout is a side effect: the flush 
wasn't taking place so the shutdown logic (correctly) gave up.

> InterruptedException when closing SparkContext
> --
>
> Key: SPARK-24523
> URL: https://issues.apache.org/jira/browse/SPARK-24523
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0, 2.3.1
> Environment: EMR 5.14.0, S3/HDFS inputs and outputs; EMR 5.17
>  
>  
>  
>Reporter: Umayr Hassan
>Priority: Major
> Attachments: spark-stop-jstack.log.1, spark-stop-jstack.log.2, 
> spark-stop-jstack.log.3, thread-dump.log
>
>
> I'm running a Scala application in EMR with the following properties:
> {{--master yarn --deploy-mode cluster --driver-memory 13g --executor-memory 
> 30g --executor-cores 5 --conf spark.default.parallelism=400 --conf 
> spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.maxExecutors=20 --conf 
> spark.eventLog.dir=hdfs:///var/log/spark/apps --conf 
> spark.eventLog.enabled=true --conf 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf 
> spark.scheduler.listenerbus.eventqueue.capacity=2 --conf 
> spark.shuffle.service.enabled=true --conf spark.sql.shuffle.partitions=400 
> --conf spark.yarn.maxAppAttempts=1}}
> The application runs fine till SparkContext is (automatically) closed, at 
> which point the SparkContext object throws. 
> {{18/06/10 10:44:43 ERROR Utils: Uncaught exception in thread pool-4-thread-1 
> java.lang.InterruptedException at java.lang.Object.wait(Native Method) at 
> java.lang.Thread.join(Thread.java:1252) at 
> java.lang.Thread.join(Thread.java:1326) at 
> org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:133) at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:54) at 
> org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219) at 
> org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1915)
>  at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at 
> org.apache.spark.SparkContext.stop(SparkContext.scala:1914) at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) 
> at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) 
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at scala.util.Try$.apply(Try.scala:192) at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)}}
>  
> I've not seen this behavior in Spark 2.0.2 and Spark 2.2.0 (for the same 
> application), so I'm not sure which change is causing Spark 2.3 to throw. Any 
> ideas?
> best,
> Umayr



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27968) ArrowEvalPythonExec.evaluate shouldn't eagerly read the first row

2019-06-06 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-27968:
-

 Summary: ArrowEvalPythonExec.evaluate shouldn't eagerly read the 
first row
 Key: SPARK-27968
 URL: https://issues.apache.org/jira/browse/SPARK-27968
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


An issue mentioned here: 
https://github.com/apache/spark/pull/24734/files#r288377915, could be decoupled 
from that PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24523) InterruptedException when closing SparkContext

2019-06-06 Thread Martin Studer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857979#comment-16857979
 ] 

Martin Studer commented on SPARK-24523:
---

We're observing the same issue on Hortonworks HDP 2.6.5 with Spark 2.3.0 and 
Hadoop 2.7.3. Specifically note the DFSClient exceptions.
{noformat}
19/06/06 18:00:36 INFO SparkContext: Invoking stop() from shutdown hook
19/06/06 18:00:36 INFO SparkUI: Stopped Spark web UI at http://:4040
19/06/06 18:00:46 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
java.util.concurrent.TimeoutException
java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask.get(FutureTask.java:205)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67)
19/06/06 18:00:46 ERROR Utils: Uncaught exception in thread pool-1-thread-1
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1252)
at java.lang.Thread.join(Thread.java:1326)
at 
org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:133)
at 
org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
at 
org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at 
org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219)
at 
org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1922)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1921)
at 
org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:573)
at 
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/06/06 18:00:46 WARN DFSClient: Unable to persist blocks in hflush for 
/spark2-history/application_1559195453904_0022.inprogress
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on /spark2-history/application_1559195453904_0022.inprogress (inode 
7393123): File is not open for writing. Holder 
DFSClient_NONMAPREDUCE_128063675_20 does not have any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3712)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.fsync(FSNamesystem.java:4324)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.fsync(NameNodeRpcServer.java:1354)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.fsync(ClientNamenodeProtocolServerSideTranslatorPB.java:926)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
at 

[jira] [Assigned] (SPARK-27961) DataSourceV2Relation should not have refresh method

2019-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27961:


Assignee: (was: Apache Spark)

> DataSourceV2Relation should not have refresh method
> ---
>
> Key: SPARK-27961
> URL: https://issues.apache.org/jira/browse/SPARK-27961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> The newly added `Refresh` method in [PR 
> #24401|https://github.com/apache/spark/pull/24401] prevented me from moving 
> DataSourceV2Relation into catalyst. It calls `case table: FileTable => 
> table.fileIndex.refresh()` while `FileTable` belongs to sql/core.
> More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by 
> design, it should not have refresh method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27961) DataSourceV2Relation should not have refresh method

2019-06-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27961:


Assignee: Apache Spark

> DataSourceV2Relation should not have refresh method
> ---
>
> Key: SPARK-27961
> URL: https://issues.apache.org/jira/browse/SPARK-27961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: Apache Spark
>Priority: Minor
>
> The newly added `Refresh` method in [PR 
> #24401|https://github.com/apache/spark/pull/24401] prevented me from moving 
> DataSourceV2Relation into catalyst. It calls `case table: FileTable => 
> table.fileIndex.refresh()` while `FileTable` belongs to sql/core.
> More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by 
> design, it should not have refresh method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27961) DataSourceV2Relation should not have refresh method

2019-06-06 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857961#comment-16857961
 ] 

Gengliang Wang commented on SPARK-27961:


So, will Spark support "refresh table" in DSV2 implementation?

> DataSourceV2Relation should not have refresh method
> ---
>
> Key: SPARK-27961
> URL: https://issues.apache.org/jira/browse/SPARK-27961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> The newly added `Refresh` method in [PR 
> #24401|https://github.com/apache/spark/pull/24401] prevented me from moving 
> DataSourceV2Relation into catalyst. It calls `case table: FileTable => 
> table.fileIndex.refresh()` while `FileTable` belongs to sql/core.
> More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by 
> design, it should not have refresh method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27918) Add boolean.sql

2019-06-06 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27918.
-
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 3.0.0

> Add boolean.sql
> ---
>
> Key: SPARK-27918
> URL: https://issues.apache.org/jira/browse/SPARK-27918
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> In this ticket, we plan to add the regression test cases of 
> [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/boolean.sql].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27924) ANSI SQL: Boolean Test

2019-06-06 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27924:

Description: 
{quote} ::=
   [ IS [ NOT ]  ]
 ::=
    TRUE
  | FALSE
  | UNKNOWN{quote}
 
Currently, the following DBMSs support the syntax:
 * PostgreSQL: [https://www.postgresql.org/docs/9.1/functions-comparison.html]
 * Hive: https://issues.apache.org/jira/browse/HIVE-13583
 * Redshift: 
[https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html]
 * Vertica: 
[https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/Boolean-predicate.htm]

  was:
Support for various forms of search conditions are mandatory in the SQL 
standard. For example: {{ is not true;}}.

 

Redshift, Vertica and Hive support this feature.

 

[https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html]
 
[https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/Boolean-predicate.htm]
 https://issues.apache.org/jira/browse/HIVE-13583

[https://www.postgresql.org/docs/12/features-sql-standard.html]

 

 


> ANSI SQL: Boolean Test
> --
>
> Key: SPARK-27924
> URL: https://issues.apache.org/jira/browse/SPARK-27924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {quote} ::=
>    [ IS [ NOT ]  ]
>  ::=
>     TRUE
>   | FALSE
>   | UNKNOWN{quote}
>  
> Currently, the following DBMSs support the syntax:
>  * PostgreSQL: [https://www.postgresql.org/docs/9.1/functions-comparison.html]
>  * Hive: https://issues.apache.org/jira/browse/HIVE-13583
>  * Redshift: 
> [https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html]
>  * Vertica: 
> [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/Boolean-predicate.htm]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27924) ANSI SQL: Boolean Test

2019-06-06 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27924:

Summary: ANSI SQL: Boolean Test  (was: E061-14: Search Conditions)

> ANSI SQL: Boolean Test
> --
>
> Key: SPARK-27924
> URL: https://issues.apache.org/jira/browse/SPARK-27924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Support for various forms of search conditions are mandatory in the SQL 
> standard. For example: {{ is not true;}}.
>  
> Redshift, Vertica and Hive support this feature.
>  
> [https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html]
>  
> [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/Boolean-predicate.htm]
>  https://issues.apache.org/jira/browse/HIVE-13583
> [https://www.postgresql.org/docs/12/features-sql-standard.html]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15015) Log statements lack file name/number

2019-06-06 Thread John-Michael Reed (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857902#comment-16857902
 ] 

John-Michael Reed commented on SPARK-15015:
---

[~srowen] - An no, it is not already resolved because I never made or tested 
the change, and I can safely assume that nobody else has either.

> Log statements lack file name/number
> 
>
> Key: SPARK-15015
> URL: https://issues.apache.org/jira/browse/SPARK-15015
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.6.1
> Environment: All
>Reporter: John-Michael Reed
>Priority: Trivial
>  Labels: bulk-closed, debug, log
>
> I would like it if the Apache Spark project had file names and line numbers 
> in its log statements like this:
> http://i.imgur.com/4hvGQ0t.png
> The example uses my library, http://johnreedlol.github.io/scala-trace-debug/, 
> but https://github.com/lihaoyi/sourcecode is also useful for this purpose. 
> The real benefit in doing this is because the user of an IDE can jump to the 
> location of a log statement without having to set breakpoints.
> http://s29.postimg.org/ud0knou1j/debug_Screenshot_Crop.png
> Note that the arrow will go to the next log statement if each log statement 
> is hyperlinked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27967) Fault tolerance broken: Race conditions: a supervised Driver is not relaunched and completely removed sometimes under Standalone cluster when Worker gracefully shuts dow

2019-06-06 Thread Igor Kamyshnikov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Kamyshnikov updated SPARK-27967:
-
Description: 
Synthetic test:
 1) run ZK
 2) run Master
 3) run Worker with remote debugging agent (required for enabling a breakpoint 
to demonstrate race conditions issue)
 4) submit a long running Driver with --supervise flag
 5) connect to Worker via remote debugger
 6) enable a breakpoint in the method:
 org.apache.spark.deploy.worker.DriverRunner#kill
{code:java}
  /** Terminate this driver (or prevent it from ever starting if not yet 
started) */
  private[worker] def kill(): Unit = {
logInfo("Killing driver process!")
killed = true
synchronized {
  process.foreach { p =>
val exitCode = Utils.terminateProcess(p, DRIVER_TERMINATE_TIMEOUT_MS)
if (exitCode.isEmpty) { //< BREAKPOINT <<
  logWarning("Failed to terminate driver process: " + p +
  ". This process will likely be orphaned.")
}
  }
}
  }
{code}
7) send SIGTERM to Worker (or CTRL+C in Windows)
 8) check Spark Master Web UI: the Driver will appear in the *Completed 
Drivers* section with the state equal to *KILLED*

If there was no breakpoint then it is more likely that a new row with 
*RELAUNCHING* state would appear in the *Completed Drivers* section and a row 
with *SUBMITTED* state would remain in the *Running Drivers* section.

Explanation:
 1) Spark master relaunches a driver in response to "channelInactive" callback: 
org.apache.spark.rpc.netty.NettyRpcHandler#channelInactive
 which is triggered when the Worker process finishes.
 2) DriverRunner registers a shutdown hook here: 
org.apache.spark.deploy.worker.DriverRunner#start which calls the 
aforementioned "kill" method. Killing a driver can lead to reaching the 
following lines in the DriverRunner.start method:
{noformat}
// notify worker of final driver state, possible exception
worker.send(DriverStateChanged(driverId, finalState.get, 
finalException))
{noformat}
If this notification reaches Master then Driver is removed from the cluster as 
KILLED.

Real-world scenario (ver. 2.1.2):
 ZK, two Masters, the Active one loses its leadership, another becomes a new 
leader.
 Workers attempt to re-register with new Master. But they report that they 
failed to do this. They execute *System.exit(1)* from 
org.apache.spark.deploy.worker.Worker#registerWithMaster.
 This System.exit results in executing shutdown hooks. And somehow 
DriverStateChanged message reaches the new master.

 

*Worker* logs:
{noformat}
19/06/03 14:05:30 INFO Worker: Retrying connection to master (attempt # 5)
19/06/03 14:05:30 INFO Worker: Connecting to master 10.0.0.16:7077...
19/06/03 14:05:33 INFO Worker: Master has changed, new master is at 
spark://10.0.0.17:7077
19/06/03 14:05:33 ERROR TransportResponseHandler: Still have 4 requests 
outstanding when connection from /10.0.0.16:7077 is closed
19/06/03 14:05:33 ERROR Worker: Cannot register with master: 10.0.0.16:7077
java.io.IOException: Connection from /10.0.0.16:7077 closed
at 
org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:128)
at 
org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at 
io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:278)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at 

[jira] [Created] (SPARK-27967) Fault tolerance broken: Race conditions: a supervised Driver is not relaunched and completely removed sometimes under Standalone cluster when Worker gracefully shuts dow

2019-06-06 Thread Igor Kamyshnikov (JIRA)
Igor Kamyshnikov created SPARK-27967:


 Summary: Fault tolerance broken: Race conditions: a supervised 
Driver is not relaunched and completely removed sometimes under Standalone 
cluster when Worker gracefully shuts down
 Key: SPARK-27967
 URL: https://issues.apache.org/jira/browse/SPARK-27967
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.2, 2.1.2
Reporter: Igor Kamyshnikov


Synthetic test:
 1) run ZK
 2) run Master
 3) run Worker with remote debugging agent (required for enabling a breakpoint 
to demonstrate race conditions issue)
 4) submit a long running Driver with --supervise flag
 5) connect to Worker via remote debugger
 6) enable a breakpoint in the method:
 org.apache.spark.deploy.worker.DriverRunner#kill
{code:java}
  /** Terminate this driver (or prevent it from ever starting if not yet 
started) */
  private[worker] def kill(): Unit = {
logInfo("Killing driver process!")
killed = true
synchronized {
  process.foreach { p =>
val exitCode = Utils.terminateProcess(p, DRIVER_TERMINATE_TIMEOUT_MS)
if (exitCode.isEmpty) { //< BREAKPOINT <<
  logWarning("Failed to terminate driver process: " + p +
  ". This process will likely be orphaned.")
}
  }
}
  }
{code}
7) send SIGTERM to Worker (or CTRL+C in Windows)
 8) check Spark Master Web UI: the Driver will appear in the *Completed 
Drivers* section with the state equal to *KILLED*

If there was no breakpoint then it is more likely that a new row with 
*RELAUNCHING* state would appear in the *Completed Drivers* section and a row 
with *SUBMITTED* state would remain in the *Running Drivers* section.

Explanation:
 1) Spark master relaunches a driver in response to "channelInactive" callback: 
org.apache.spark.rpc.netty.NettyRpcHandler#channelInactive
 which is triggered when the Worker process finishes.
 2) DriverRunner registers a shutdown hook here: 
org.apache.spark.deploy.worker.DriverRunner#start which calls the 
aforementioned "kill" method. Killing a driver can lead to reaching the 
following lines in the DriverRunner.start method:
{noformat}
// notify worker of final driver state, possible exception
worker.send(DriverStateChanged(driverId, finalState.get, 
finalException))
{noformat}
If this notification reaches Master then Driver is removed from the cluster as 
KILLED.

Real-world scenario (ver. 2.1.2):
 ZK, two Masters, the Active one loses its leadership, another becomes a new 
leader.
 Workers attempt to re-register with new master. But the report they failed to 
do this. They execute *System.exit(1)* from 
org.apache.spark.deploy.worker.Worker#registerWithMaster.
 This System.exit results in executing shutdown hooks. And somehow 
DriverStateChanged message reaches the new master.

 

*Worker* logs:
{noformat}
19/06/03 14:05:30 INFO Worker: Retrying connection to master (attempt # 5)
19/06/03 14:05:30 INFO Worker: Connecting to master 10.0.0.16:7077...
19/06/03 14:05:33 INFO Worker: Master has changed, new master is at 
spark://10.0.0.17:7077
19/06/03 14:05:33 ERROR TransportResponseHandler: Still have 4 requests 
outstanding when connection from /10.0.0.16:7077 is closed
19/06/03 14:05:33 ERROR Worker: Cannot register with master: 10.0.0.16:7077
java.io.IOException: Connection from /10.0.0.16:7077 closed
at 
org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:128)
at 
org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at 
io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:278)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at 

[jira] [Updated] (SPARK-27924) E061-14: Search Conditions

2019-06-06 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27924:

Target Version/s: 3.0.0

> E061-14: Search Conditions
> --
>
> Key: SPARK-27924
> URL: https://issues.apache.org/jira/browse/SPARK-27924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Support for various forms of search conditions are mandatory in the SQL 
> standard. For example: {{ is not true;}}.
>  
> Redshift, Vertica and Hive support this feature.
>  
> [https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html]
>  
> [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/Boolean-predicate.htm]
>  https://issues.apache.org/jira/browse/HIVE-13583
> [https://www.postgresql.org/docs/12/features-sql-standard.html]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27883) Add aggregates.sql - Part2

2019-06-06 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-27883:
---

Assignee: Yuming Wang

> Add aggregates.sql - Part2
> --
>
> Key: SPARK-27883
> URL: https://issues.apache.org/jira/browse/SPARK-27883
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> In this ticket, we plan to add the regression test cases of 
> https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/aggregates.sql#L145-L350



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27883) Add aggregates.sql - Part2

2019-06-06 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27883.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Add aggregates.sql - Part2
> --
>
> Key: SPARK-27883
> URL: https://issues.apache.org/jira/browse/SPARK-27883
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> In this ticket, we plan to add the regression test cases of 
> https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/aggregates.sql#L145-L350



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26896) Add maven profiles for running tests with JDK 11

2019-06-06 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857843#comment-16857843
 ] 

Sean Owen commented on SPARK-26896:
---

BTW should we close this? I don't think we need a new profile.

> Add maven profiles for running tests with JDK 11
> 
>
> Key: SPARK-26896
> URL: https://issues.apache.org/jira/browse/SPARK-26896
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> Running unit tests w/ JDK 11 trips over some issues w/ the new module system. 
>  These can be worked around with the new {{--add-opens}} etc. commands.  I 
> think we need to add a build profile for JDK 11 to add some extra args to the 
> test runners.
> In particular:
> 1) removal of jaxb from java itself (used in pmml export in mllib)
> 2) Some reflective access which results in failures, eg. 
> {noformat}
> Unable to make field jdk.internal.ref.PhantomCleanable
> jdk.internal.ref.PhantomCleanable.prev accessible: module java.base does
> not "opens jdk.internal.ref" to unnamed module
> {noformat}
> 3) Some reflective access which results in warnings (you can add 
> {{--illegal-access=warn}} to see all of these).
> All I'm proposing we do here is put in the required handling to make these 
> problems go away, not necessarily do the "right" thing by no longer 
> referencing these unexposed internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25994) SPIP: Property Graphs, Cypher Queries, and Algorithms

2019-06-06 Thread Martin Junghanns (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857833#comment-16857833
 ] 

Martin Junghanns commented on SPARK-25994:
--

Hi [~RBerenguel]. Thanks a lot for your interest. I think a good way to get 
involved is to join the discussion on the PRs. As soon as we have 
https://issues.apache.org/jira/browse/SPARK-27300 merged, we'll open a PR that 
inludes the API for property graph construction. I think this is the perfect 
opportunity to discuss your ideas and get an understanding of how the Python 
API should behave. wdyt?

> SPIP: Property Graphs, Cypher Queries, and Algorithms
> -
>
> Key: SPARK-25994
> URL: https://issues.apache.org/jira/browse/SPARK-25994
> Project: Spark
>  Issue Type: Epic
>  Components: Graph
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Martin Junghanns
>Priority: Major
>  Labels: SPIP
>
> Copied from the SPIP doc:
> {quote}
> GraphX was one of the foundational pillars of the Spark project, and is the 
> current graph component. This reflects the importance of the graphs data 
> model, which naturally pairs with an important class of analytic function, 
> the network or graph algorithm. 
> However, GraphX is not actively maintained. It is based on RDDs, and cannot 
> exploit Spark 2’s Catalyst query engine. GraphX is only available to Scala 
> users.
> GraphFrames is a Spark package, which implements DataFrame-based graph 
> algorithms, and also incorporates simple graph pattern matching with fixed 
> length patterns (called “motifs”). GraphFrames is based on DataFrames, but 
> has a semantically weak graph data model (based on untyped edges and 
> vertices). The motif pattern matching facility is very limited by comparison 
> with the well-established Cypher language. 
> The Property Graph data model has become quite widespread in recent years, 
> and is the primary focus of commercial graph data management and of graph 
> data research, both for on-premises and cloud data management. Many users of 
> transactional graph databases also wish to work with immutable graphs in 
> Spark.
> The idea is to define a Cypher-compatible Property Graph type based on 
> DataFrames; to replace GraphFrames querying with Cypher; to reimplement 
> GraphX/GraphFrames algos on the PropertyGraph type. 
> To achieve this goal, a core subset of Cypher for Apache Spark (CAPS), 
> reusing existing proven designs and code, will be employed in Spark 3.0. This 
> graph query processor, like CAPS, will overlay and drive the SparkSQL 
> Catalyst query engine, using the CAPS graph query planner.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27704) Change default class loader to ParallelGC

2019-06-06 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27704.
---
Resolution: Not A Problem

Do you mean garbage collector? the user can already choose the GC by setting 
JVM options. There is no default in Spark. I don't think there's anything to do 
here.

> Change default class loader to ParallelGC
> -
>
> Key: SPARK-27704
> URL: https://issues.apache.org/jira/browse/SPARK-27704
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Mihaly Toth
>Priority: Major
>
> In JDK 11 the default class loader changed from ParallelGC to G1GC. Even 
> though this gc performs better on pause times and interactivity, most of the 
> tasks that need to be processed are more sensitive to throughput and the to 
> the amount of memory. G1 sacrifices these to some extend to avoid the big 
> pauses. As a result the user may perceive a regression compared to JDK 8. 
> Even worse, the regression may not be limited to performance only but some 
> jobs may start failing in case they do not fit into the memory they used to 
> be happy with when running with previous JDK.
> Some other kind of apps, like streaming ones, may rather use G1 because of 
> their more interactive, more realtime needs.
> With this jira it is proposed to have a configurable default GC for all spark 
> applications. This may be overridable by the user through command line 
> parameters. The default value of the default GC (in case it is not provided 
> in spark-defaults.conf) could be ParallelGC.
> I do not see this change required but I think it would benefit to the user 
> experience.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27961) DataSourceV2Relation should not have refresh method

2019-06-06 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857823#comment-16857823
 ] 

Gengliang Wang commented on SPARK-27961:


[~jzhuge][~rdblue] Make sense. I will remove it.

> DataSourceV2Relation should not have refresh method
> ---
>
> Key: SPARK-27961
> URL: https://issues.apache.org/jira/browse/SPARK-27961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> The newly added `Refresh` method in [PR 
> #24401|https://github.com/apache/spark/pull/24401] prevented me from moving 
> DataSourceV2Relation into catalyst. It calls `case table: FileTable => 
> table.fileIndex.refresh()` while `FileTable` belongs to sql/core.
> More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by 
> design, it should not have refresh method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27966) input_file_name empty when listing files in parallel

2019-06-06 Thread Christian Homberg (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Homberg updated SPARK-27966:
--
Attachment: input_file_name_bug

> input_file_name empty when listing files in parallel
> 
>
> Key: SPARK-27966
> URL: https://issues.apache.org/jira/browse/SPARK-27966
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
> Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
>  
> Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
> Workers: 3
> Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
>Reporter: Christian Homberg
>Priority: Minor
> Attachments: input_file_name_bug
>
>
> I ran into an issue similar and probably related to SPARK-26128. The 
> _org.apache.spark.sql.functions.input_file_name_ is sometimes empty.
>  
> {code:java}
> df.select(input_file_name()).show(5,false)
> {code}
>  
> {code:java}
> +-+
> |input_file_name()|
> +-+
> | |
> | |
> | |
> | |
> | |
> +-+
> {code}
> My environment is databricks and debugging the Log4j output showed me that 
> the issue occurred when the files are being listed in parallel, e.g. when 
> {code:java}
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 127; threshold: 32
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under:{code}
>  
> Everything's fine as long as
> {code:java}
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 6; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> {code}
>  
> Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  
> resolves the issue for me.
>  
> edit: the problem is not exclusively linked to listing files in parallel. 
> I've setup a larger cluster for which after parallel file listing the 
> input_file_name did return the correct filename. After inspecting the log4j 
> again, I assume that it's linked to some kind of MetaStore being full. I've 
> attached a section of the log4j output that I think should indicate why it's 
> failing. If you need more, please let me know.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27966) input_file_name empty when listing files in parallel

2019-06-06 Thread Christian Homberg (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Homberg updated SPARK-27966:
--
Description: 
I ran into an issue similar and probably related to SPARK-26128. The 
_org.apache.spark.sql.functions.input_file_name_ is sometimes empty.

 
{code:java}
df.select(input_file_name()).show(5,false)
{code}
 
{code:java}
+-+
|input_file_name()|
+-+
| |
| |
| |
| |
| |
+-+
{code}
My environment is databricks and debugging the Log4j output showed me that the 
issue occurred when the files are being listed in parallel, e.g. when 
{code:java}
19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 127; threshold: 32
19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories in 
parallel under:{code}
 

Everything's fine as long as
{code:java}
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 6; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
{code}
 

Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  resolves 
the issue for me.

 

*edit: the problem is not exclusively linked to listing files in parallel. I've 
setup a larger cluster for which after parallel file listing the 
input_file_name did return the correct filename. After inspecting the log4j 
again, I assume that it's linked to some kind of MetaStore being full. I've 
attached a section of the log4j output that I think should indicate why it's 
failing. If you need more, please let me know.*

 ** 

 

 

  was:
I ran into an issue similar and probably related to SPARK-26128. The 
_org.apache.spark.sql.functions.input_file_name_ is sometimes empty.

 
{code:java}
df.select(input_file_name()).show(5,false)
{code}
 
{code:java}
+-+
|input_file_name()|
+-+
| |
| |
| |
| |
| |
+-+
{code}
My environment is databricks and debugging the Log4j output showed me that the 
issue occurred when the files are being listed in parallel, e.g. when 
{code:java}
19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 127; threshold: 32
19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories in 
parallel under:{code}
 

Everything's fine as long as
{code:java}
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 6; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
{code}
 

Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  resolves 
the issue for me.

 

edit: the problem is not exclusively linked to listing files in parallel. I've 
setup a larger cluster for which after parallel file listing the 
input_file_name did return the correct filename. After inspecting the log4j 
again, I assume that it's linked to some kind of MetaStore being full. I've 
attached a section of the log4j output that I think should indicate why it's 
failing. If you need more, please let me know.

 

 

 


> input_file_name empty when listing files in parallel
> 
>
> Key: SPARK-27966
> URL: https://issues.apache.org/jira/browse/SPARK-27966
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
> Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
>  
> Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
> Workers: 3
> Driver Type: 

[jira] [Updated] (SPARK-27966) input_file_name empty when listing files in parallel

2019-06-06 Thread Christian Homberg (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Homberg updated SPARK-27966:
--
Description: 
I ran into an issue similar and probably related to SPARK-26128. The 
_org.apache.spark.sql.functions.input_file_name_ is sometimes empty.

 
{code:java}
df.select(input_file_name()).show(5,false)
{code}
 
{code:java}
+-+
|input_file_name()|
+-+
| |
| |
| |
| |
| |
+-+
{code}
My environment is databricks and debugging the Log4j output showed me that the 
issue occurred when the files are being listed in parallel, e.g. when 
{code:java}
19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 127; threshold: 32
19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories in 
parallel under:{code}
 

Everything's fine as long as
{code:java}
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 6; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
{code}
 

Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  resolves 
the issue for me.

 

edit: the problem is not exclusively linked to listing files in parallel. I've 
setup a larger cluster for which after parallel file listing the 
input_file_name did return the correct filename. After inspecting the log4j 
again, I assume that it's linked to some kind of MetaStore being full. I've 
attached a section of the log4j output that I think should indicate why it's 
failing. If you need more, please let me know.

 

 

 

  was:
I ran into an issue similar and probably related to SPARK-26128. The 
_org.apache.spark.sql.functions.input_file_name_ is sometimes empty.

 
{code:java}
df.select(input_file_name()).show(5,false)
{code}
 
{code:java}
+-+
|input_file_name()|
+-+
| |
| |
| |
| |
| |
+-+
{code}
My environment is databricks and debugging the Log4j output showed me that the 
issue occurred when the files are being listed in parallel, e.g. when 
{code:java}
19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 127; threshold: 32
19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories in 
parallel under:{code}
 

Everything's fine as long as
{code:java}
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 6; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
{code}
 

Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  resolves 
the issue for me.

 

edit: the problem is not exclusively linked to listing files in parallel. I've 
setup a larger cluster for which after parallel file listing the 
input_file_name did return the correct filename. After inspecting the log4j 
again, I assume that it's linked to some kind of MetaStore being full:

 

 

 


> input_file_name empty when listing files in parallel
> 
>
> Key: SPARK-27966
> URL: https://issues.apache.org/jira/browse/SPARK-27966
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
> Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
>  
> Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
> Workers: 3
> Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
>Reporter: Christian Homberg
>Priority: Minor
>
> I ran into 

[jira] [Updated] (SPARK-27966) input_file_name empty when listing files in parallel

2019-06-06 Thread Christian Homberg (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Homberg updated SPARK-27966:
--
Description: 
I ran into an issue similar and probably related to SPARK-26128. The 
_org.apache.spark.sql.functions.input_file_name_ is sometimes empty.

 
{code:java}
df.select(input_file_name()).show(5,false)
{code}
 
{code:java}
+-+
|input_file_name()|
+-+
| |
| |
| |
| |
| |
+-+
{code}
My environment is databricks and debugging the Log4j output showed me that the 
issue occurred when the files are being listed in parallel, e.g. when 
{code:java}
19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 127; threshold: 32
19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories in 
parallel under:{code}
 

Everything's fine as long as
{code:java}
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 6; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
{code}
 

Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  resolves 
the issue for me.

 

edit: the problem is not exclusively linked to listing files in parallel. I've 
setup a larger cluster for which after parallel file listing the 
input_file_name did return the correct filename. After inspecting the log4j 
again, I assume that it's linked to some kind of MetaStore being full:

 

 

 

  was:
I ran into an issue similar and probably related to SPARK-26128. The 
_org.apache.spark.sql.functions.input_file_name_ is sometimes empty.

 
{code:java}
df.select(input_file_name()).show(5,false)
{code}
 
{code:java}
+-+
|input_file_name()|
+-+
| |
| |
| |
| |
| |
+-+
{code}
My environment is databricks and debugging the Log4j output showed me that the 
issue occurred when the files are being listed in parallel, e.g. when 
{code:java}
19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 127; threshold: 32
19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories in 
parallel under:{code}
 

Everything's fine as long as
{code:java}
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 6; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
{code}
 

Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  resolves 
the issue for me.

 

edit: the problem is not exclusively linked to listing files in parallel. I've 
setup a larger cluster for which after parallel file listing the 
input_file_name did return the correct filename. After inspecting the log4j 
again, I assume that it's linked to


> input_file_name empty when listing files in parallel
> 
>
> Key: SPARK-27966
> URL: https://issues.apache.org/jira/browse/SPARK-27966
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
> Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
>  
> Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
> Workers: 3
> Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
>Reporter: Christian Homberg
>Priority: Minor
>
> I ran into an issue similar and probably related to SPARK-26128. The 
> _org.apache.spark.sql.functions.input_file_name_ is sometimes empty.
>  
> {code:java}
> 

[jira] [Updated] (SPARK-27966) input_file_name empty when listing files in parallel

2019-06-06 Thread Christian Homberg (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Homberg updated SPARK-27966:
--
Description: 
I ran into an issue similar and probably related to SPARK-26128. The 
_org.apache.spark.sql.functions.input_file_name_ is sometimes empty.

 
{code:java}
df.select(input_file_name()).show(5,false)
{code}
 
{code:java}
+-+
|input_file_name()|
+-+
| |
| |
| |
| |
| |
+-+
{code}
My environment is databricks and debugging the Log4j output showed me that the 
issue occurred when the files are being listed in parallel, e.g. when 
{code:java}
19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 127; threshold: 32
19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories in 
parallel under:{code}
 

Everything's fine as long as
{code:java}
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 6; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
{code}
 

Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  resolves 
the issue for me.

 

edit: the problem is not exclusively linked to listing files in parallel. I've 
setup a larger cluster for which after parallel file listing the 
input_file_name did return the correct filename. After inspecting the log4j 
again, I assume that it's linked to

  was:
I ran into an issue similar and probably related to SPARK-26128. The 
_org.apache.spark.sql.functions.input_file_name_ is sometimes empty.

 
{code:java}
df.select(input_file_name()).show(5,false)
{code}
 
{code:java}
+-+
|input_file_name()|
+-+
| |
| |
| |
| |
| |
+-+
{code}
My environment is databricks and debugging the Log4j output showed me that the 
issue occurred when the files are being listed in parallel, e.g. when 
{code:java}
19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 127; threshold: 32
19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories in 
parallel under:{code}
 

Everything's fine as long as
{code:java}
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 6; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
{code}
 

Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  resolves 
the issue for me.


> input_file_name empty when listing files in parallel
> 
>
> Key: SPARK-27966
> URL: https://issues.apache.org/jira/browse/SPARK-27966
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
> Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
>  
> Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
> Workers: 3
> Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
>Reporter: Christian Homberg
>Priority: Minor
>
> I ran into an issue similar and probably related to SPARK-26128. The 
> _org.apache.spark.sql.functions.input_file_name_ is sometimes empty.
>  
> {code:java}
> df.select(input_file_name()).show(5,false)
> {code}
>  
> {code:java}
> +-+
> |input_file_name()|
> +-+
> | |
> | |
> | |
> | |
> | |
> +-+
> {code}
> My environment is databricks and debugging the Log4j output showed me that 
> the issue occurred when the files are being listed in 

[jira] [Commented] (SPARK-15015) Log statements lack file name/number

2019-06-06 Thread John-Michael Reed (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857772#comment-16857772
 ] 

John-Michael Reed commented on SPARK-15015:
---

[~srowen] [~WangTao] [~hyukjin.kwon],

I am just going to go ahead, pull from the master branch, create a JAR file 
with the modified logger, stick that modified logger in as a dependency to the 
build, and change the imports so that Spark uses the modified logger with the 
file name and line numbers which are obtained through Scala macros at compile 
time with virtually zero runtime overhead. I will test out the new debug logger 
with the Spark codebase and then I will re-open this issue along with a 
proposed code change. If any of you has a problem with that, just let me know.

Sincerely,

- John

> Log statements lack file name/number
> 
>
> Key: SPARK-15015
> URL: https://issues.apache.org/jira/browse/SPARK-15015
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.6.1
> Environment: All
>Reporter: John-Michael Reed
>Priority: Trivial
>  Labels: bulk-closed, debug, log
>
> I would like it if the Apache Spark project had file names and line numbers 
> in its log statements like this:
> http://i.imgur.com/4hvGQ0t.png
> The example uses my library, http://johnreedlol.github.io/scala-trace-debug/, 
> but https://github.com/lihaoyi/sourcecode is also useful for this purpose. 
> The real benefit in doing this is because the user of an IDE can jump to the 
> location of a log statement without having to set breakpoints.
> http://s29.postimg.org/ud0knou1j/debug_Screenshot_Crop.png
> Note that the arrow will go to the next log statement if each log statement 
> is hyperlinked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15015) Log statements lack file name/number

2019-06-06 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15015.
---
Resolution: Auto Closed

We're not going to reopen these unless there's a material change, like someone 
is working on it. This one was closed because it affects a long since EOL 
version. It'd at least have to be checked vs master, and checked whether it's 
already resolved, etc.

> Log statements lack file name/number
> 
>
> Key: SPARK-15015
> URL: https://issues.apache.org/jira/browse/SPARK-15015
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.6.1
> Environment: All
>Reporter: John-Michael Reed
>Priority: Trivial
>  Labels: bulk-closed, debug, log
>
> I would like it if the Apache Spark project had file names and line numbers 
> in its log statements like this:
> http://i.imgur.com/4hvGQ0t.png
> The example uses my library, http://johnreedlol.github.io/scala-trace-debug/, 
> but https://github.com/lihaoyi/sourcecode is also useful for this purpose. 
> The real benefit in doing this is because the user of an IDE can jump to the 
> location of a log statement without having to set breakpoints.
> http://s29.postimg.org/ud0knou1j/debug_Screenshot_Crop.png
> Note that the arrow will go to the next log statement if each log statement 
> is hyperlinked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20856) support statement using nested joins

2019-06-06 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857719#comment-16857719
 ] 

Sean Owen commented on SPARK-20856:
---

We're not going to reopen these unless there's a material change, like someone 
is working on it. This one was closed because it affects a long since EOL 
version. It'd at least have to be checked vs master, and checked whether it's 
already resolved, etc.

> support statement using nested joins
> 
>
> Key: SPARK-20856
> URL: https://issues.apache.org/jira/browse/SPARK-20856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: N Campbell
>Priority: Major
>  Labels: bulk-closed
>
> While DB2, ORACLE etc support a join expressed as follows, SPARK SQL does 
> not. 
> Not supported
> select * from 
>   cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
>  on tbint.rnum = tint.rnum
>  on tint.rnum = tsint.rnum
> versus written as shown
> select * from 
>   cert.tsint tsint inner join cert.tint tint on tsint.rnum = tint.rnum inner 
> join cert.tbint tbint on tint.rnum = tbint.rnum
>
> ERROR_STATE, SQL state: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input 'on' expecting {, ',', '.', '[', 'WHERE', 'GROUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 
> 'IS', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', EQ, '<=>', 
> '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', 
> '^', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 4, pos 5)
> == SQL ==
> select * from 
>   cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
>  on tbint.rnum = tint.rnum
>  on tint.rnum = tsint.rnum
> -^^^
> , Query: select * from 
>   cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
>  on tbint.rnum = tint.rnum
>  on tint.rnum = tsint.rnum.
> SQLState:  HY000
> ErrorCode: 500051



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20856) support statement using nested joins

2019-06-06 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20856.
---
Resolution: Auto Closed

We're not going to reopen these unless there's a material change, like someone 
is working on it. This one was closed because it affects a long since EOL 
version. It'd at least have to be checked vs master, and checked whether it's 
already resolved, etc.

> support statement using nested joins
> 
>
> Key: SPARK-20856
> URL: https://issues.apache.org/jira/browse/SPARK-20856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: N Campbell
>Priority: Major
>  Labels: bulk-closed
>
> While DB2, ORACLE etc support a join expressed as follows, SPARK SQL does 
> not. 
> Not supported
> select * from 
>   cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
>  on tbint.rnum = tint.rnum
>  on tint.rnum = tsint.rnum
> versus written as shown
> select * from 
>   cert.tsint tsint inner join cert.tint tint on tsint.rnum = tint.rnum inner 
> join cert.tbint tbint on tint.rnum = tbint.rnum
>
> ERROR_STATE, SQL state: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input 'on' expecting {, ',', '.', '[', 'WHERE', 'GROUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 
> 'IS', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', EQ, '<=>', 
> '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', 
> '^', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 4, pos 5)
> == SQL ==
> select * from 
>   cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
>  on tbint.rnum = tint.rnum
>  on tint.rnum = tsint.rnum
> -^^^
> , Query: select * from 
>   cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
>  on tbint.rnum = tint.rnum
>  on tint.rnum = tsint.rnum.
> SQLState:  HY000
> ErrorCode: 500051



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21136) Misleading error message for typo in SQL

2019-06-06 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21136:
--
Priority: Minor  (was: Critical)

> Misleading error message for typo in SQL
> 
>
> Key: SPARK-21136
> URL: https://issues.apache.org/jira/browse/SPARK-21136
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>Assignee: Yesheng Ma
>Priority: Minor
>
> {code}
> scala> spark.sql("select * from a left joinn b on a.id = b.id").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', 
> 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 
> 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9)
> == SQL ==
> select * from a left joinn b on a.id = b.id
> -^^^
> {code}
> The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of 
> the error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in 
> themselves, a misleading error like this can hinder debugging substantially.
> I tried to see if maybe I could fix this. Am I correct to deduce that the 
> error message originates in ANTLR4, which parses the query based on the 
> syntax defined in {{SqlBase.g4}}? If so, I guess I would have to figure out 
> how that syntax definition works, and why it misattributes the error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27966) input_file_name empty when listing files in parallel

2019-06-06 Thread Christian Homberg (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Homberg updated SPARK-27966:
--
Description: 
I ran into an issue similar and probably related to SPARK-26128. The 
_org.apache.spark.sql.functions.input_file_name_ is sometimes empty.

 
{code:java}
df.select(input_file_name()).show(5,false)
{code}
 
{code:java}
+-+
|input_file_name()|
+-+
| |
| |
| |
| |
| |
+-+
{code}
My environment is databricks and debugging the Log4j output showed me that the 
issue occurred when the files are being listed in parallel, e.g. when 
{code:java}
19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 127; threshold: 32
19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories in 
parallel under:{code}
 

Everything's fine as long as
{code:java}
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 6; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 0; threshold: 32
{code}
 

Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  resolves 
the issue for me.

  was:
I ran into an issue similar and probably related to SPARK-26128. The 
`org.apache.spark.sql.functions.input_file_name` is sometime empty.

My environment is databricks and debugging Log4j output showed me that the 
issue occurred when the files are being listed in parallel, e.g. when 

19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 127; threshold: 32

 

This is not an issue when listing less than 32 files. Alternatively setting 
spark.sql.sources.parallelPartitionDiscovery.threshold to  resolves the 
issue.


> input_file_name empty when listing files in parallel
> 
>
> Key: SPARK-27966
> URL: https://issues.apache.org/jira/browse/SPARK-27966
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
> Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
>  
> Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
> Workers: 3
> Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
>Reporter: Christian Homberg
>Priority: Minor
>
> I ran into an issue similar and probably related to SPARK-26128. The 
> _org.apache.spark.sql.functions.input_file_name_ is sometimes empty.
>  
> {code:java}
> df.select(input_file_name()).show(5,false)
> {code}
>  
> {code:java}
> +-+
> |input_file_name()|
> +-+
> | |
> | |
> | |
> | |
> | |
> +-+
> {code}
> My environment is databricks and debugging the Log4j output showed me that 
> the issue occurred when the files are being listed in parallel, e.g. when 
> {code:java}
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 127; threshold: 32
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under:{code}
>  
> Everything's fine as long as
> {code:java}
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 6; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> {code}
>  
> Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  
> resolves the issue for me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (SPARK-27966) input_file_name empty when listing files in parallel

2019-06-06 Thread Christian Homberg (JIRA)
Christian Homberg created SPARK-27966:
-

 Summary: input_file_name empty when listing files in parallel
 Key: SPARK-27966
 URL: https://issues.apache.org/jira/browse/SPARK-27966
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.4.0
 Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)

 
Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
Workers: 3
Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
Reporter: Christian Homberg


I ran into an issue similar and probably related to SPARK-26128. The 
`org.apache.spark.sql.functions.input_file_name` is sometime empty.

My environment is databricks and debugging Log4j output showed me that the 
issue occurred when the files are being listed in parallel, e.g. when 

19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
directories. Size of Paths: 127; threshold: 32

 

This is not an issue when listing less than 32 files. Alternatively setting 
spark.sql.sources.parallelPartitionDiscovery.threshold to  resolves the 
issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27943) Implement default constraint with Column for Hive table

2019-06-06 Thread jiaan.geng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-27943:
---
Description: 
 
 *Background*

Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

*Design*

Hive is widely used in production environments and is the standard in the field 
of big data in fact.

But Hive exists many version used in production and the feature between each 
version are different.

Spark SQL need to implement default constraint, but there are three points to 
pay attention to in design:

_First_, Spark SQL should reduce coupling with Hive.

_Second_, default constraint could compatible with different versions of Hive.

_Thrid_, Which expression of default constraint should Spark SQL support? I 
think should support `literal`, `current_date()`, `current_timestamp()`. Maybe 
other expression should also supported, like `Cast(1 as float)`, `1 + 2` and so 
on.

We want to save the metadata of default constraint into properties of Hive 
table, and then we restore metadata from the properties after client gets 
newest metadata.The implement is the same as other metadata (e.g. 
partition,bucket,statistics).

Because default constraint is part of column, so I think could reuse the 
metadata of StructField. The default constraint will cached by metadata of 
StructField.

 

*Tasks*

This is a big work, wo I want to split this work into some sub tasks, as 
follows:

 

  was:
 
*Background*
Default constraint with column is ANSI standard.

Hive 3.0+ has supported default constraint 
ref:https://issues.apache.org/jira/browse/HIVE-18726

But Spark SQL implement this feature not yet.

*Design*

Hive is widely used in production environments and is the standard in the field 
of big data in fact. But Hive exists many version used in production and the 
feature between each version are different.

Spark SQL need to implement default constraint, but there are two points to pay 
attention to in design:

One is Spark SQL should reduce coupling with Hive. 

Another is  default constraint could compatible with different versions of Hive.

We want to save the metadata of default constraint into properties of Hive 
table, and then we restore metadata from the properties after client gets 
newest metadata.

The implement is the same as other metadata (e.g. partition,bucket,statistics).

Because default constraint is part of column, so I think could reuse the 
metadata of StructField. The default constraint will cached by metadata of 
StructField.

*Tasks*

This is a big work, wo I want to split this work into some sub tasks, as 
follows:

 


> Implement default constraint with Column for Hive table
> ---
>
> Key: SPARK-27943
> URL: https://issues.apache.org/jira/browse/SPARK-27943
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
>  
>  *Background*
> Default constraint with column is ANSI standard.
> Hive 3.0+ has supported default constraint 
> ref:https://issues.apache.org/jira/browse/HIVE-18726
> But Spark SQL implement this feature not yet.
> *Design*
> Hive is widely used in production environments and is the standard in the 
> field of big data in fact.
> But Hive exists many version used in production and the feature between each 
> version are different.
> Spark SQL need to implement default constraint, but there are three points to 
> pay attention to in design:
> _First_, Spark SQL should reduce coupling with Hive.
> _Second_, default constraint could compatible with different versions of Hive.
> _Thrid_, Which expression of default constraint should Spark SQL support? I 
> think should support `literal`, `current_date()`, `current_timestamp()`. 
> Maybe other expression should also supported, like `Cast(1 as float)`, `1 + 
> 2` and so on.
> We want to save the metadata of default constraint into properties of Hive 
> table, and then we restore metadata from the properties after client gets 
> newest metadata.The implement is the same as other metadata (e.g. 
> partition,bucket,statistics).
> Because default constraint is part of column, so I think could reuse the 
> metadata of StructField. The default constraint will cached by metadata of 
> StructField.
>  
> *Tasks*
> This is a big work, wo I want to split this work into some sub tasks, as 
> follows:
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-06-06 Thread Edwin Biemond (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857534#comment-16857534
 ] 

Edwin Biemond commented on SPARK-27927:
---

[~hyukjin.kwon] do you have a clue why in 2.4.X and 3.0 I have to close my 
sparkContext in pyspark else the driver pod will hang forever.  

thanks Edwin

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27931) Accept 'on' and 'off' as input for boolean data type

2019-06-06 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27931:

Description: 
This ticket contains three things:
1. Accept 'on' and 'off' as input for boolean data type
{code:sql}
SELECT cast('no' as boolean) AS false;
SELECT cast('off' as boolean) AS false;
{code}
2. Accept unique prefixes thereof:
{code:sql}
SELECT cast('of' as boolean) AS false;
SELECT cast('fal' as boolean) AS false;
{code}
3. Trim the string when cast to boolean type
{code:sql}
SELECT cast('true   ' as boolean) AS true;
SELECT cast(' FALSE' as boolean) AS true;
{code}

More details:
[https://www.postgresql.org/docs/devel/datatype-boolean.html]
[https://github.com/postgres/postgres/blob/REL_12_BETA1/src/backend/utils/adt/bool.c#L25]
[https://github.com/postgres/postgres/commit/05a7db05826c5eb68173b6d7ef1553c19322ef48]

Other DBs:
http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html
https://my.vertica.com/docs/5.0/HTML/Master/2983.htm
https://github.com/prestosql/presto/blob/b845cd66da3eb1fcece50efba83ea12bc40afbaa/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L108-L138

  was:
This ticket contains three things:
1. Accept 'on' and 'off' as input for boolean data type
Example:
{code:sql}
SELECT cast('no' as boolean) AS false;
SELECT cast('off' as boolean) AS false;
{code}
2. Accept unique prefixes thereof:
Example:
{code:sql}
SELECT cast('of' as boolean) AS false;
SELECT cast('fal' as boolean) AS false;
{code}
3. Trim the string when cast to boolean type
{code:sql}
SELECT cast('true   ' as boolean) AS true;
SELECT cast(' FALSE' as boolean) AS true;
{code}

More details:
[https://www.postgresql.org/docs/devel/datatype-boolean.html]
[https://github.com/postgres/postgres/blob/REL_12_BETA1/src/backend/utils/adt/bool.c#L25]
[https://github.com/postgres/postgres/commit/05a7db05826c5eb68173b6d7ef1553c19322ef48]

Other DBs:
http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html
https://my.vertica.com/docs/5.0/HTML/Master/2983.htm
https://github.com/prestosql/presto/blob/b845cd66da3eb1fcece50efba83ea12bc40afbaa/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L108-L138


> Accept 'on' and 'off' as input for boolean data type
> 
>
> Key: SPARK-27931
> URL: https://issues.apache.org/jira/browse/SPARK-27931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> This ticket contains three things:
> 1. Accept 'on' and 'off' as input for boolean data type
> {code:sql}
> SELECT cast('no' as boolean) AS false;
> SELECT cast('off' as boolean) AS false;
> {code}
> 2. Accept unique prefixes thereof:
> {code:sql}
> SELECT cast('of' as boolean) AS false;
> SELECT cast('fal' as boolean) AS false;
> {code}
> 3. Trim the string when cast to boolean type
> {code:sql}
> SELECT cast('true   ' as boolean) AS true;
> SELECT cast(' FALSE' as boolean) AS true;
> {code}
> More details:
> [https://www.postgresql.org/docs/devel/datatype-boolean.html]
> [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/backend/utils/adt/bool.c#L25]
> [https://github.com/postgres/postgres/commit/05a7db05826c5eb68173b6d7ef1553c19322ef48]
> Other DBs:
> http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html
> https://my.vertica.com/docs/5.0/HTML/Master/2983.htm
> https://github.com/prestosql/presto/blob/b845cd66da3eb1fcece50efba83ea12bc40afbaa/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L108-L138



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27931) Accept 'on' and 'off' as input for boolean data type

2019-06-06 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27931:

Description: 
This ticket contains three things:
1. Accept 'on' and 'off' as input for boolean data type
Example:
{code:sql}
SELECT cast('no' as boolean) AS false;
SELECT cast('off' as boolean) AS false;
{code}
2. Accept unique prefixes thereof:
Example:
{code:sql}
SELECT cast('of' as boolean) AS false;
SELECT cast('fal' as boolean) AS false;
{code}
3. Trim the string when cast to boolean type
{code:sql}
SELECT cast('true   ' as boolean) AS true;
SELECT cast(' FALSE' as boolean) AS true;
{code}

More details:
[https://www.postgresql.org/docs/devel/datatype-boolean.html]
[https://github.com/postgres/postgres/blob/REL_12_BETA1/src/backend/utils/adt/bool.c#L25]
[https://github.com/postgres/postgres/commit/05a7db05826c5eb68173b6d7ef1553c19322ef48]

Other DBs:
http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html
https://my.vertica.com/docs/5.0/HTML/Master/2983.htm
https://github.com/prestosql/presto/blob/b845cd66da3eb1fcece50efba83ea12bc40afbaa/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L108-L138

  was:
This ticket contains two things:
1. Accept 'on' and 'off' as input for boolean data type
Example:
{code:sql}
SELECT cast('no' as boolean) AS false;
SELECT cast('off' as boolean) AS false;
{code}
2. Accept unique prefixes thereof:
Example:
{code:sql}
SELECT cast('of' as boolean) AS false;
SELECT cast('fal' as boolean) AS false;
{code}
3. Trim the string when cast to boolean type
{code:sql}
SELECT cast('true   ' as boolean) AS true;
SELECT cast(' FALSE' as boolean) AS true;
{code}

More details:
[https://www.postgresql.org/docs/devel/datatype-boolean.html]
[https://github.com/postgres/postgres/blob/REL_12_BETA1/src/backend/utils/adt/bool.c#L25]
[https://github.com/postgres/postgres/commit/05a7db05826c5eb68173b6d7ef1553c19322ef48]

Other DBs:
http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html
https://my.vertica.com/docs/5.0/HTML/Master/2983.htm
https://github.com/prestosql/presto/blob/b845cd66da3eb1fcece50efba83ea12bc40afbaa/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L108-L138


> Accept 'on' and 'off' as input for boolean data type
> 
>
> Key: SPARK-27931
> URL: https://issues.apache.org/jira/browse/SPARK-27931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> This ticket contains three things:
> 1. Accept 'on' and 'off' as input for boolean data type
> Example:
> {code:sql}
> SELECT cast('no' as boolean) AS false;
> SELECT cast('off' as boolean) AS false;
> {code}
> 2. Accept unique prefixes thereof:
> Example:
> {code:sql}
> SELECT cast('of' as boolean) AS false;
> SELECT cast('fal' as boolean) AS false;
> {code}
> 3. Trim the string when cast to boolean type
> {code:sql}
> SELECT cast('true   ' as boolean) AS true;
> SELECT cast(' FALSE' as boolean) AS true;
> {code}
> More details:
> [https://www.postgresql.org/docs/devel/datatype-boolean.html]
> [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/backend/utils/adt/bool.c#L25]
> [https://github.com/postgres/postgres/commit/05a7db05826c5eb68173b6d7ef1553c19322ef48]
> Other DBs:
> http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html
> https://my.vertica.com/docs/5.0/HTML/Master/2983.htm
> https://github.com/prestosql/presto/blob/b845cd66da3eb1fcece50efba83ea12bc40afbaa/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L108-L138



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org