date:20190126



 [ 
https://issues.apache.org/jira/browse/SPARK-26735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-26735.
-
   Resolution: Fixed
 Assignee: Kris Mok
Fix Version/s: 3.0.0

> Verify plan integrity for special expressions
> -
>
> Key: SPARK-26735
> URL: https://issues.apache.org/jira/browse/SPARK-26735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
> Fix For: 3.0.0
>
>
> Add verification of plan integrity with regards to special expressions being 
> hosted only in supported operators. Specifically:
> * AggregateExpression: should only be hosted in Aggregate, or indirectly in 
> Window
> * WindowExpression: should only be hosted in Window
> * Generator: should only be hosted in Generate



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26744) Create API supportDataType in File Source V2 framework

2019-01-26 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-26744:
--

 Summary: Create API supportDataType in File Source V2 framework
 Key: SPARK-26744
 URL: https://issues.apache.org/jira/browse/SPARK-26744
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26743) Add a test to check the actual resource limit set via 'spark.executor.pyspark.memory'



 [ 
https://issues.apache.org/jira/browse/SPARK-26743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26743:


Assignee: (was: Apache Spark)

> Add a test to check the actual resource limit set via 
> 'spark.executor.pyspark.memory'
> -
>
> Key: SPARK-26743
> URL: https://issues.apache.org/jira/browse/SPARK-26743
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Looks the test that checks the actual resource limit set (by 
> 'spark.executor.pyspark.memory') is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26743) Add a test to check the actual resource limit set via 'spark.executor.pyspark.memory'



 [ 
https://issues.apache.org/jira/browse/SPARK-26743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26743:


Assignee: Apache Spark

> Add a test to check the actual resource limit set via 
> 'spark.executor.pyspark.memory'
> -
>
> Key: SPARK-26743
> URL: https://issues.apache.org/jira/browse/SPARK-26743
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Looks the test that checks the actual resource limit set (by 
> 'spark.executor.pyspark.memory') is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26743) Add a test to check the actual resource limit set via 'spark.executor.pyspark.memory'



 [ 
https://issues.apache.org/jira/browse/SPARK-26743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26743:
-
Description: Looks the test that checks the actual resource limit set (by 
'spark.executor.pyspark.memory') is missing.  (was: Looks the test that checks 
the actual resource limit set (by 'spark.executor.pyspark.memory'0 is missing.)

> Add a test to check the actual resource limit set via 
> 'spark.executor.pyspark.memory'
> -
>
> Key: SPARK-26743
> URL: https://issues.apache.org/jira/browse/SPARK-26743
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Looks the test that checks the actual resource limit set (by 
> 'spark.executor.pyspark.memory') is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26080) Unable to run worker.py on Windows



 [ 
https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26080:
-
Labels: release-notes  (was: )

> Unable to run worker.py on Windows
> --
>
> Key: SPARK-26080
> URL: https://issues.apache.org/jira/browse/SPARK-26080
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Windows 10 Education 64 bit
>Reporter: Hayden Jeune
>Assignee: Hyukjin Kwon
>Priority: Blocker
>  Labels: release-notes
> Fix For: 2.4.1, 3.0.0
>
>
> Use of the resource module in python means worker.py cannot run on a windows 
> system. This package is only available in unix based environments.
> [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25]
> {code:python}
> textFile = sc.textFile("README.md")
> textFile.first()
> {code}
> When the above commands are run I receive the error 'worker failed to connect 
> back', and I can see an exception in the console coming from worker.py saying 
> 'ModuleNotFoundError: No module named resource'
> I do not really know enough about what I'm doing to fix this myself. 
> Apologies if there's something simple I'm missing here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26743) Add a test to check the actual resource limit set via 'spark.executor.pyspark.memory'

Hyukjin Kwon created SPARK-26743:


 Summary: Add a test to check the actual resource limit set via 
'spark.executor.pyspark.memory'
 Key: SPARK-26743
 URL: https://issues.apache.org/jira/browse/SPARK-26743
 Project: Spark
  Issue Type: Test
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


Looks the test that checks the actual resource limit set (by 
'spark.executor.pyspark.memory'0 is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25981) Arrow optimization for conversion from R DataFrame to Spark DataFrame



 [ 
https://issues.apache.org/jira/browse/SPARK-25981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25981.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22954
[https://github.com/apache/spark/pull/22954]

> Arrow optimization for conversion from R DataFrame to Spark DataFrame
> -
>
> Key: SPARK-25981
> URL: https://issues.apache.org/jira/browse/SPARK-25981
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> PySpark introduced an optimization for toPandas and createDataFrame with 
> Pandas DataFrame.
> This was leveraged by PyArrow API.
> R Arrow API is under developement 
> (https://github.com/apache/arrow/tree/master/r) and about to be released via 
> CRAN (https://issues.apache.org/jira/browse/ARROW-3204).
> Once it's released, we can reuse PySpark's Arrow optimization code path and 
> leverage it with minimised codes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26630) Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce



[ 
https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753248#comment-16753248
 ] 

Dongjoon Hyun commented on SPARK-26630:
---

Thank you, [~Deegue] and [~smilegator].

> Support reading Hive-serde tables whose INPUTFORMAT is 
> org.apache.hadoop.mapreduce
> --
>
> Key: SPARK-26630
> URL: https://issues.apache.org/jira/browse/SPARK-26630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Deegue
>Assignee: Deegue
>Priority: Major
> Fix For: 3.0.0
>
>
> This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR 
> #23506).
> It will throw ClassCastException when we use new input format (eg. 
> `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to 
> use NewHadoopRDD to deal with this input format in TableReader.scala.
> Exception :
> {noformat}
> Caused by: java.lang.ClassCastException: 
> org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to 
> org.apache.hadoop.mapred.InputFormat
>   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:96)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 87 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25981) Arrow optimization for conversion from R DataFrame to Spark DataFrame



 [ 
https://issues.apache.org/jira/browse/SPARK-25981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25981:


Assignee: Hyukjin Kwon

> Arrow optimization for conversion from R DataFrame to Spark DataFrame
> -
>
> Key: SPARK-25981
> URL: https://issues.apache.org/jira/browse/SPARK-25981
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> PySpark introduced an optimization for toPandas and createDataFrame with 
> Pandas DataFrame.
> This was leveraged by PyArrow API.
> R Arrow API is under developement 
> (https://github.com/apache/arrow/tree/master/r) and about to be released via 
> CRAN (https://issues.apache.org/jira/browse/ARROW-3204).
> Once it's released, we can reuse PySpark's Arrow optimization code path and 
> leverage it with minimised codes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26663) Cannot query a Hive table with subdirectories



[ 
https://issues.apache.org/jira/browse/SPARK-26663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753218#comment-16753218
 ] 

Dongjoon Hyun commented on SPARK-26663:
---

I'm closing this issue as `Cannot Reproduce`. It may depend on your environment.

> Cannot query a Hive table with subdirectories
> -
>
> Key: SPARK-26663
> URL: https://issues.apache.org/jira/browse/SPARK-26663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Aäron
>Priority: Major
>
> Hello,
>  
> I want to report the following issue (my first one :) )
> When I create a table in Hive based on a union all then Spark 2.4 is unable 
> to query this table.
> To reproduce:
> *Hive 1.2.1*
> {code:java}
> hive> creat table a(id int);
> insert into a values(1);
> hive> creat table b(id int);
> insert into b values(2);
> hive> create table c(id int) as select id from a union all select id from b;
> {code}
>  
> *Spark 2.3.1*
>  
> {code:java}
> scala> spark.table("c").show
> +---+
> | id|
> +---+
> | 1|
> | 2|
> +---+
> scala> spark.table("c").count
> res5: Long = 2
>  {code}
>  
> *Spark 2.4.0*
> {code:java}
> scala> spark.table("c").show
> 19/01/18 17:00:49 WARN HiveMetastoreCatalog: Unable to infer schema for table 
> perftest_be.c from file format ORC (inference mode: INFER_AND_SAVE). Using 
> metastore schema.
> +---+
> | id|
> +---+
> +---+
> scala> spark.table("c").count
> res3: Long = 0
> {code}
> I did not find an existing issue for this.  Might be important to investigate.
>  
> +Extra info:+ Spark 2.3.1 and 2.4.0 use the same spark-defaults.conf.
>  
> Kind regards.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26663) Cannot query a Hive table with subdirectories



 [ 
https://issues.apache.org/jira/browse/SPARK-26663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26663.
---
Resolution: Cannot Reproduce

> Cannot query a Hive table with subdirectories
> -
>
> Key: SPARK-26663
> URL: https://issues.apache.org/jira/browse/SPARK-26663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Aäron
>Priority: Major
>
> Hello,
>  
> I want to report the following issue (my first one :) )
> When I create a table in Hive based on a union all then Spark 2.4 is unable 
> to query this table.
> To reproduce:
> *Hive 1.2.1*
> {code:java}
> hive> creat table a(id int);
> insert into a values(1);
> hive> creat table b(id int);
> insert into b values(2);
> hive> create table c(id int) as select id from a union all select id from b;
> {code}
>  
> *Spark 2.3.1*
>  
> {code:java}
> scala> spark.table("c").show
> +---+
> | id|
> +---+
> | 1|
> | 2|
> +---+
> scala> spark.table("c").count
> res5: Long = 2
>  {code}
>  
> *Spark 2.4.0*
> {code:java}
> scala> spark.table("c").show
> 19/01/18 17:00:49 WARN HiveMetastoreCatalog: Unable to infer schema for table 
> perftest_be.c from file format ORC (inference mode: INFER_AND_SAVE). Using 
> metastore schema.
> +---+
> | id|
> +---+
> +---+
> scala> spark.table("c").count
> res3: Long = 0
> {code}
> I did not find an existing issue for this.  Might be important to investigate.
>  
> +Extra info:+ Spark 2.3.1 and 2.4.0 use the same spark-defaults.conf.
>  
> Kind regards.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26663) Cannot query a Hive table with subdirectories



[ 
https://issues.apache.org/jira/browse/SPARK-26663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753216#comment-16753216
 ] 

Dongjoon Hyun commented on SPARK-26663:
---

Hi, [~pomptuintje]. Thank you for reporting. Actually, the given example has 
incorrect syntax like `creat table`. It would be better if you reports with the 
script what you used. I do the following but I cannot reproduce the issue.
{code:java}
Logging initialized using configuration in 
jar:file:/Users/dongjoon/APACHE/hive-release/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties
hive> create table a(id int);
OK
Time taken: 1.299 seconds
hive> create table b(id int);
OK
Time taken: 0.046 seconds
hive> insert into a values(1);
Query ID = dongjoon_20190126145804_2c0252e1-d07c-4213-a387-90efe26d450b
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-01-26 14:58:06,272 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local2005651311_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: 
file:/user/hive/warehouse/a/.hive-staging_hive_2019-01-26_14-58-04_030_4426868381325183205-1/-ext-1
Loading data to table default.a
Table default.a stats: [numFiles=1, numRows=1, totalSize=2, rawDataSize=1]
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 2.436 seconds
hive> insert into b values(1);
Query ID = dongjoon_20190126145810_034d9c36-0f23-42a6-ac0a-681839335bd6
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-01-26 14:58:11,941 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local966105199_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: 
file:/user/hive/warehouse/b/.hive-staging_hive_2019-01-26_14-58-10_554_693159949912597124-1/-ext-1
Loading data to table default.b
Table default.b stats: [numFiles=1, numRows=1, totalSize=2, rawDataSize=1]
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 1.551 seconds
hive> create table c as select id from a union all select id from b;
Query ID = dongjoon_20190126145831_c2b31651-c88b-47ab-9081-2375cf064b15
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-01-26 14:58:33,130 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local1725928125_0003
Stage-4 is filtered out by condition resolver.
Stage-3 is selected by condition resolver.
Stage-5 is filtered out by condition resolver.
Launching Job 3 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-01-26 14:58:34,449 Stage-3 map = 100%,  reduce = 0%
Ended Job = job_local1246940820_0004
Moving data to: file:/user/hive/warehouse/c
Table default.c stats: [numFiles=1, numRows=2, totalSize=4, rawDataSize=2]
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Stage-Stage-3:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 2.806 seconds
{code}
{code:java}
19/01/26 14:58:42 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context available as 'sc' (master = local[*], app id = 
local-1548543527800).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
  /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sql("select * from c").show
19/01/26 14:58:57 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
+---+
| id|
+---+
|  1|
|  1|
+---+
{code}

> Cannot query a Hive table with subdirectories
> -
>
> Key: SPARK-26663
> URL: https://issues.apache.org/jira/browse/SPARK-26663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Aäron
>Priority: Major
>
> Hello,
>  
> I want to report the following issue (my first one :) )
> When I create a table in Hive based on a union all then Spark 2.4 is unable 
> to query this table.
> To reproduce:

[jira] [Commented] (SPARK-26675) Error happened during creating avro files



[ 
https://issues.apache.org/jira/browse/SPARK-26675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753213#comment-16753213
 ] 

Dongjoon Hyun commented on SPARK-26675:
---

[~tony0918]. Do you have a problem in Spark Scala Shell environment, too?

> Error happened during creating avro files
> -
>
> Key: SPARK-26675
> URL: https://issues.apache.org/jira/browse/SPARK-26675
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Tony Mao
>Priority: Major
>
> Run cmd
> {code:java}
> spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 
> /nke/reformat.py
> {code}
> code in reformat.py
> {code:java}
> df = spark.read.option("multiline", "true").json("file:///nke/example1.json")
> df.createOrReplaceTempView("traffic")
> a = spark.sql("""SELECT store.*, store.name as store_name, 
> store.dataSupplierId as store_dataSupplierId, trafficSensor.*,
> trafficSensor.name as trafficSensor_name, trafficSensor.dataSupplierId as 
> trafficSensor_dataSupplierId, readings.*
> FROM (SELECT explode(stores) as store, explode(store.trafficSensors) as 
> trafficSensor,
> explode(trafficSensor.trafficSensorReadings) as readings FROM traffic)""")
> b = a.drop("trafficSensors", "trafficSensorReadings", "name", 
> "dataSupplierId")
> b.write.format("avro").save("file:///nke/curated/namesAndFavColors.avro")
> {code}
> Error message:
> {code:java}
> Traceback (most recent call last):
> File "/nke/reformat.py", line 18, in 
> b.select("store_name", 
> "store_dataSupplierId").write.format("avro").save("file:///nke/curated/namesAndFavColors.avro")
> File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/readwriter.py", 
> line 736, in save
> File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
> line 1257, in __call__
> File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, 
> in deco
> File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o45.save.
> : java.lang.NoSuchMethodError: 
> org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema;
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185)
> at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:176)
> at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:174)
> at scala.collection.Iterator$class.foreach(Iterator.scala:891)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:174)
> at 
> org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118)
> at 
> org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.avro.AvroFileFormat.prepareWrite(AvroFileFormat.scala:118)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
> at 
>

[jira] [Commented] (SPARK-26675) Error happened during creating avro files



[ 
https://issues.apache.org/jira/browse/SPARK-26675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753211#comment-16753211
 ] 

Dongjoon Hyun commented on SPARK-26675:
---

cc [~Gengliang.Wang]

> Error happened during creating avro files
> -
>
> Key: SPARK-26675
> URL: https://issues.apache.org/jira/browse/SPARK-26675
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Tony Mao
>Priority: Major
>
> Run cmd
> {code:java}
> spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 
> /nke/reformat.py
> {code}
> code in reformat.py
> {code:java}
> df = spark.read.option("multiline", "true").json("file:///nke/example1.json")
> df.createOrReplaceTempView("traffic")
> a = spark.sql("""SELECT store.*, store.name as store_name, 
> store.dataSupplierId as store_dataSupplierId, trafficSensor.*,
> trafficSensor.name as trafficSensor_name, trafficSensor.dataSupplierId as 
> trafficSensor_dataSupplierId, readings.*
> FROM (SELECT explode(stores) as store, explode(store.trafficSensors) as 
> trafficSensor,
> explode(trafficSensor.trafficSensorReadings) as readings FROM traffic)""")
> b = a.drop("trafficSensors", "trafficSensorReadings", "name", 
> "dataSupplierId")
> b.write.format("avro").save("file:///nke/curated/namesAndFavColors.avro")
> {code}
> Error message:
> {code:java}
> Traceback (most recent call last):
> File "/nke/reformat.py", line 18, in 
> b.select("store_name", 
> "store_dataSupplierId").write.format("avro").save("file:///nke/curated/namesAndFavColors.avro")
> File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/readwriter.py", 
> line 736, in save
> File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
> line 1257, in __call__
> File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, 
> in deco
> File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o45.save.
> : java.lang.NoSuchMethodError: 
> org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema;
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185)
> at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:176)
> at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:174)
> at scala.collection.Iterator$class.foreach(Iterator.scala:891)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:174)
> at 
> org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118)
> at 
> org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.avro.AvroFileFormat.prepareWrite(AvroFileFormat.scala:118)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
> at 
>

[jira] [Commented] (SPARK-26742) Bump Kubernetes Client Version to 4.1.1



[ 
https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753207#comment-16753207
 ] 

Dongjoon Hyun commented on SPARK-26742:
---

BTW, we are already using 4.1.0.
{code}
4.1.0
{code}

> Bump Kubernetes Client Version to 4.1.1
> ---
>
> Key: SPARK-26742
> URL: https://issues.apache.org/jira/browse/SPARK-26742
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Steve Davids
>Priority: Major
>  Labels: easyfix
>
> Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master 
> branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest 
> Kubernetes compatibility support for newer clusters: 
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26742) Bump Kubernetes Client Version to 4.1.1



[ 
https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753207#comment-16753207
 ] 

Dongjoon Hyun edited comment on SPARK-26742 at 1/26/19 10:38 PM:
-

BTW, we are already using 4.1.0 for Spark 3.0.0.
{code}
4.1.0
{code}


was (Author: dongjoon):
BTW, we are already using 4.1.0.
{code}
4.1.0
{code}

> Bump Kubernetes Client Version to 4.1.1
> ---
>
> Key: SPARK-26742
> URL: https://issues.apache.org/jira/browse/SPARK-26742
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Steve Davids
>Priority: Major
>  Labels: easyfix
>
> Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master 
> branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest 
> Kubernetes compatibility support for newer clusters: 
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26742) Bump Kubernetes Client Version to 4.1.1



[ 
https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753206#comment-16753206
 ] 

Dongjoon Hyun commented on SPARK-26742:
---

SPARK-26603 is working on the testing environment and trying to embrace new K8S 
versions.

> Bump Kubernetes Client Version to 4.1.1
> ---
>
> Key: SPARK-26742
> URL: https://issues.apache.org/jira/browse/SPARK-26742
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Steve Davids
>Priority: Major
>  Labels: easyfix
>
> Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master 
> branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest 
> Kubernetes compatibility support for newer clusters: 
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26742) Bump Kubernetes Client Version to 4.1.1

2019-01-26 Thread Steve Davids (JIRA)

Steve Davids created SPARK-26742:


 Summary: Bump Kubernetes Client Version to 4.1.1
 Key: SPARK-26742
 URL: https://issues.apache.org/jira/browse/SPARK-26742
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Kubernetes
Affects Versions: 2.4.0, 3.0.0
Reporter: Steve Davids


Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master branch 
has 4.0, the client should be upgraded to 4.1.1 to have the broadest Kubernetes 
compatibility support for newer clusters: 
https://github.com/fabric8io/kubernetes-client#compatibility-matrix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26741) Analyzer incorrectly resolves aggregate function outside of Aggregate operators

2019-01-26 Thread Kris Mok (JIRA)

Kris Mok created SPARK-26741:


 Summary: Analyzer incorrectly resolves aggregate function outside 
of Aggregate operators
 Key: SPARK-26741
 URL: https://issues.apache.org/jira/browse/SPARK-26741
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kris Mok


The analyzer can sometimes hit issues with resolving functions. e.g.
{code:sql}
select max(id)
from range(10)
group by id
having count(1) >= 1
order by max(id)
{code}
The analyzed plan of this query is:
{code:none}
== Analyzed Logical Plan ==
max(id): bigint
Project [max(id)#91L]
+- Sort [max(id#88L) ASC NULLS FIRST], true
   +- Project [max(id)#91L, id#88L]
  +- Filter (count(1)#93L >= cast(1 as bigint))
 +- Aggregate [id#88L], [max(id#88L) AS max(id)#91L, count(1) AS 
count(1)#93L, id#88L]
+- Range (0, 10, step=1, splits=None)
{code}
Note how an aggregate function is outside of {{Aggregate}} operators in the 
fully analyzed plan:
{{Sort [max(id#88L) ASC NULLS FIRST], true}}, which makes the plan invalid.

Trying to run this query will lead to weird issues in codegen, but the root 
cause is in the analyzer:
{code:none}
java.lang.UnsupportedOperationException: Cannot generate code for expression: 
max(input[1, bigint, false])
  at 
org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291)
  at 
org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290)
  at 
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
  at scala.Option.getOrElse(Option.scala:138)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.$anonfun$createOrderKeys$1(GenerateOrdering.scala:82)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at scala.collection.TraversableLike.map(TraversableLike.scala:237)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
  at scala.collection.AbstractTraversable.map(Traversable.scala:108)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.createOrderKeys(GenerateOrdering.scala:82)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.genComparisons(GenerateOrdering.scala:91)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:152)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:44)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1194)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:195)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:192)
  at 
org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:153)
  at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3302)
  at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2470)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3291)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3287)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2470)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2684)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:262)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:299)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:753)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:712)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:721)
{code}

The test case {{SPARK-23957 Remove redundant sort from subquery plan(scalar 
subquery)}} in {{SubquerySuite}} has been disabled because of hitting this 
issue, caught by SPARK-26735. We should re-enable that test once this bug is 
fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26740) Statistics for date and timestamp columns depend on system time zone



 [ 
https://issues.apache.org/jira/browse/SPARK-26740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26740:


Assignee: Apache Spark

> Statistics for date and timestamp columns depend on system time zone
> 
>
> Key: SPARK-26740
> URL: https://issues.apache.org/jira/browse/SPARK-26740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> While saving statistics for timestamp/date columns, default time zone is used 
> in conversion of internal type (microseconds or days since epoch) to textual 
> representation. The textual representation doesn't contain time zone. So, 
> when it is converted back to internal types (Long for TimestampType or 
> DateType), the Timestamp.valueOf and Date.valueOf are used in conversions. 
> The methods use current system time zone.
> If system time zone is different while saving and retrieving statistics for 
> timestamp/date columns, restored microseconds/days since epoch will be 
> different.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26740) Statistics for date and timestamp columns depend on system time zone



 [ 
https://issues.apache.org/jira/browse/SPARK-26740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26740:


Assignee: (was: Apache Spark)

> Statistics for date and timestamp columns depend on system time zone
> 
>
> Key: SPARK-26740
> URL: https://issues.apache.org/jira/browse/SPARK-26740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> While saving statistics for timestamp/date columns, default time zone is used 
> in conversion of internal type (microseconds or days since epoch) to textual 
> representation. The textual representation doesn't contain time zone. So, 
> when it is converted back to internal types (Long for TimestampType or 
> DateType), the Timestamp.valueOf and Date.valueOf are used in conversions. 
> The methods use current system time zone.
> If system time zone is different while saving and retrieving statistics for 
> timestamp/date columns, restored microseconds/days since epoch will be 
> different.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26740) Statistics for date and timestamp columns depend on system time zone

2019-01-26 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-26740:
--

 Summary: Statistics for date and timestamp columns depend on 
system time zone
 Key: SPARK-26740
 URL: https://issues.apache.org/jira/browse/SPARK-26740
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


While saving statistics for timestamp/date columns, default time zone is used 
in conversion of internal type (microseconds or days since epoch) to textual 
representation. The textual representation doesn't contain time zone. So, when 
it is converted back to internal types (Long for TimestampType or DateType), 
the Timestamp.valueOf and Date.valueOf are used in conversions. The methods use 
current system time zone.
If system time zone is different while saving and retrieving statistics for 
timestamp/date columns, restored microseconds/days since epoch will be 
different.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26630) Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce



[ 
https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753146#comment-16753146
 ] 

Xiao Li commented on SPARK-26630:
-

This is an improvement. Thanks!


> Support reading Hive-serde tables whose INPUTFORMAT is 
> org.apache.hadoop.mapreduce
> --
>
> Key: SPARK-26630
> URL: https://issues.apache.org/jira/browse/SPARK-26630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Deegue
>Assignee: Deegue
>Priority: Major
> Fix For: 3.0.0
>
>
> This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR 
> #23506).
> It will throw ClassCastException when we use new input format (eg. 
> `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to 
> use NewHadoopRDD to deal with this input format in TableReader.scala.
> Exception :
> {noformat}
> Caused by: java.lang.ClassCastException: 
> org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to 
> org.apache.hadoop.mapred.InputFormat
>   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:96)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 87 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26630) Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce



 [ 
https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-26630.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Support reading Hive-serde tables whose INPUTFORMAT is 
> org.apache.hadoop.mapreduce
> --
>
> Key: SPARK-26630
> URL: https://issues.apache.org/jira/browse/SPARK-26630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Deegue
>Assignee: Deegue
>Priority: Major
> Fix For: 3.0.0
>
>
> This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR 
> #23506).
> It will throw ClassCastException when we use new input format (eg. 
> `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to 
> use NewHadoopRDD to deal with this input format in TableReader.scala.
> Exception :
> {noformat}
> Caused by: java.lang.ClassCastException: 
> org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to 
> org.apache.hadoop.mapred.InputFormat
>   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:96)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 87 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26630) Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce



 [ 
https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-26630:
---

Assignee: Deegue

> Support reading Hive-serde tables whose INPUTFORMAT is 
> org.apache.hadoop.mapreduce
> --
>
> Key: SPARK-26630
> URL: https://issues.apache.org/jira/browse/SPARK-26630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Deegue
>Assignee: Deegue
>Priority: Major
>
> This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR 
> #23506).
> It will throw ClassCastException when we use new input format (eg. 
> `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to 
> use NewHadoopRDD to deal with this input format in TableReader.scala.
> Exception :
> {noformat}
> Caused by: java.lang.ClassCastException: 
> org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to 
> org.apache.hadoop.mapred.InputFormat
>   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:96)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 87 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26630) Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce



 [ 
https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26630:

Summary: Support reading Hive-serde tables whose INPUTFORMAT is 
org.apache.hadoop.mapreduce  (was: ClassCastException in TableReader while 
creating HadoopRDD)

> Support reading Hive-serde tables whose INPUTFORMAT is 
> org.apache.hadoop.mapreduce
> --
>
> Key: SPARK-26630
> URL: https://issues.apache.org/jira/browse/SPARK-26630
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Deegue
>Priority: Major
>
> This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR 
> #23506).
> It will throw ClassCastException when we use new input format (eg. 
> `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to 
> use NewHadoopRDD to deal with this input format in TableReader.scala.
> Exception :
> {noformat}
> Caused by: java.lang.ClassCastException: 
> org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to 
> org.apache.hadoop.mapred.InputFormat
>   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:96)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 87 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26630) Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce



 [ 
https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26630:

Issue Type: Improvement  (was: Bug)

> Support reading Hive-serde tables whose INPUTFORMAT is 
> org.apache.hadoop.mapreduce
> --
>
> Key: SPARK-26630
> URL: https://issues.apache.org/jira/browse/SPARK-26630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Deegue
>Priority: Major
>
> This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR 
> #23506).
> It will throw ClassCastException when we use new input format (eg. 
> `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to 
> use NewHadoopRDD to deal with this input format in TableReader.scala.
> Exception :
> {noformat}
> Caused by: java.lang.ClassCastException: 
> org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to 
> org.apache.hadoop.mapred.InputFormat
>   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:96)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 87 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26739) Standardized Join Types for DataFrames



 [ 
https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Skyler Lehan updated SPARK-26739:
-
Description: 
h3. *Q1.* What are you trying to do? Articulate your objectives using 
absolutely no jargon.

Currently, in the on join functions 
[DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset],
 the join types are defined via a string parameter called joinType. In order 
for a developer to know which joins are possible, they must look up the API 
call for join. While this works fine, it can cause the developer to make a typo 
resulting in improper joins and/or unexpected errors. The objective of this 
improvement would be to allow developers to use a common definition for join 
types (by enum or constants) called JoinTypes. This would contain the possible 
joins and remove the possibility of a typo. It would also allow Spark to alter 
the names of the joins in the future without impacting end-users.
h3. *Q2.* What problem is this proposal NOT designed to solve?

The problem this solves is extremely narrow, it would not solve anything other 
than providing a common definition for join types.
h3. *Q3.* How is it done today, and what are the limits of current practice?

Currently, developers must join two DataFrames like so:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer")
{code}
Where they manually type the join type. As stated above, this:
 * Requires developers to manually type in the join
 * Can cause possibility of typos
 * Restricts renaming of join types as its a literal string
 * Does not restrict and/or compile check the join type being used, leading to 
runtime errors

h3. *Q4.* What is new in your approach and why do you think it will be 
successful?

The new approach would use constants, something like this:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
JoinType.LEFT_OUTER)
{code}
This would provide:
 * In code reference/definitions of the possible join types
 ** This subsequently allows the addition of scaladoc of what each join type 
does and how it operates
 * Removes possibilities of a typo on the join type
 * Provides compile time checking of the join type (only if an enum is used)

To clarify, if JoinType is a constant, it would just fill in the joinType 
string parameter for users. If an enum is used, it would restrict the domain of 
possible join types to whatever is defined in the future JoinType enum. The 
enum is preferred, however it would take longer to implement.
h3. *Q5.* Who cares? If you are successful, what difference will it make?

Developers using Apache Spark will care. This will make the join function 
easier to wield and lead to less runtime errors. It will save time by bringing 
join type validation at compile time. It will also provide in code reference to 
the join types, which saves the developer time of having to look up and 
navigate the multiple join functions to find the possible join types. In 
addition to that, the resulting constants/enum would have documentation on how 
that join type works.
h3. *Q6.* What are the risks?

Users of Apache Spark who currently use strings to define their join types 
could be impacted if an enum is chosen as the common definition. This risk can 
be mitigated by using string constants. The string constants would be the exact 
same string as the string literals used today. For example:
{code:java}
JoinType.INNER = "inner"
{code}
If an enum is still the preferred way of defining the join types, new join 
functions could be added that take in these enums and the join calls that 
contain string parameters for joinType could be deprecated. This would give 
developers a chance to change over to the new join types.
h3. *Q7.* How long will it take?

A few days for a seasoned Spark developer.
h3. *Q8.* What are the mid-term and final "exams" to check for success?

Mid-term exam would be the addition of a common definition of the join types 
and additional join functions that take in the join type enum/constant. The 
final exam would be working tests written to check the functionality of these 
new join functions and the join functions that take a string for joinType would 
be deprecated.
h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, 
if any. Backward and forward compatibility must be taken into account.

*If enums are used:*

The following join function signatures would be added to the Dataset API:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame
{code}
The following functions would be deprecated:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): 
DataFrame
{code}
A new enum would be created called JoinType. Developers would be

[jira] [Updated] (SPARK-26739) Standardized Join Types for DataFrames



 [ 
https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Skyler Lehan updated SPARK-26739:
-
Description: 
h3. *Q1.* What are you trying to do? Articulate your objectives using 
absolutely no jargon.

Currently, in the join functions on 
[DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset],
 the join types are defined via a string parameter called joinType. In order 
for a developer to know which joins are possible, they must look up the API 
call for join. While this works fine, it can cause the developer to make a typo 
resulting in improper joins and/or unexpected errors. The objective of this 
improvement would be to allow developers to use a common definition for join 
types (by enum or constants) called JoinTypes. This would contain the possible 
joins and remove the possibility of a typo. It would also allow Spark to alter 
the names of the joins in the future without impacting end-users.
h3. *Q2.* What problem is this proposal NOT designed to solve?

The problem this solves is extremely narrow, it would not solve anything other 
than providing a common definition for join types.
h3. *Q3.* How is it done today, and what are the limits of current practice?

Currently, developers must join two DataFrames like so:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer")
{code}
Where they manually type the join type. As stated above, this:
 * Requires developers to manually type in the join
 * Can cause possibility of typos
 * Restricts renaming of join types as its a literal string
 * Does not restrict and/or compile check the join type being used, leading to 
runtime errors

h3. *Q4.* What is new in your approach and why do you think it will be 
successful?

The new approach would use constants, something like this:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
JoinType.LEFT_OUTER)
{code}
This would provide:
 * In code reference/definitions of the possible join types
 ** This subsequently allows the addition of scaladoc of what each join type 
does and how it operates
 * Removes possibilities of a typo on the join type
 * Provides compile time checking of the join type (only if an enum is used)

To clarify, if JoinType is a constant, it would just fill in the joinType 
string parameter for users. If an enum is used, it would restrict the domain of 
possible join types to whatever is defined in the future JoinType enum. The 
enum is preferred, however it would take longer to implement.
h3. *Q5.* Who cares? If you are successful, what difference will it make?

Developers using Apache Spark will care. This will make the join function 
easier to wield and lead to less runtime errors. It will save time by bringing 
join type validation at compile time. It will also provide in code reference to 
the join types, which saves the developer time of having to look up and 
navigate the multiple join functions to find the possible join types. In 
addition to that, the resulting constants/enum would have documentation on how 
that join type works.
h3. *Q6.* What are the risks?

Users of Apache Spark who currently use strings to define their join types 
could be impacted if an enum is chosen as the common definition. This risk can 
be mitigated by using string constants. The string constants would be the exact 
same string as the string literals used today. For example:
{code:java}
JoinType.INNER = "inner"
{code}
If an enum is still the preferred way of defining the join types, new join 
functions could be added that take in these enums and the join calls that 
contain string parameters for joinType could be deprecated. This would give 
developers a chance to change over to the new join types.
h3. *Q7.* How long will it take?

A few days for a seasoned Spark developer.
h3. *Q8.* What are the mid-term and final "exams" to check for success?

Mid-term exam would be the addition of a common definition of the join types 
and additional join functions that take in the join type enum/constant. The 
final exam would be working tests written to check the functionality of these 
new join functions and the join functions that take a string for joinType would 
be deprecated.
h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, 
if any. Backward and forward compatibility must be taken into account.

*If enums are used:*

The following join function signatures would be added to the Dataset API:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame
{code}
The following functions would be deprecated:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): 
DataFrame
{code}
A new enum would be created called JoinType. Developers would be

[jira] [Updated] (SPARK-26739) Standardized Join Types for DataFrames



 [ 
https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Skyler Lehan updated SPARK-26739:
-
Description: 
h3. *Q1.* What are you trying to do? Articulate your objectives using 
absolutely no jargon.

Currently, in the [join 
functions|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@join(right:org.apache.spark.sql.Dataset[_],joinExprs:org.apache.spark.sql.Column,joinType:String):org.apache.spark.sql.DataFrame]
 on DataFrames, the join types are defined via a string parameter called 
joinType. In order for a developer to know which joins are possible, they must 
look up the API call for join. While this works fine, it can cause the 
developer to make a typo resulting in improper joins and/or unexpected errors. 
The objective of this improvement would be to allow developers to use a common 
definition for join types (by enum or constants) called JoinTypes. This would 
contain the possible joins and remove the possibility of a typo. It would also 
allow Spark to alter the names of the joins in the future without impacting 
end-users.
h3. *Q2.* What problem is this proposal NOT designed to solve?

The problem this solves is extremely narrow, it would not solve anything other 
than providing a common definition for join types.
h3. *Q3.* How is it done today, and what are the limits of current practice?

Currently, developers must join two DataFrames like so:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer")
{code}
Where they manually type the join type. As stated above, this:
 * Requires developers to manually type in the join
 * Can cause possibility of typos
 * Restricts renaming of join types as its a literal string
 * Does not restrict and/or compile check the join type being used, leading to 
runtime errors

h3. *Q4.* What is new in your approach and why do you think it will be 
successful?

The new approach would use constants, something like this:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
JoinType.LEFT_OUTER)
{code}
This would provide:
 * In code reference/definitions of the possible join types
 ** This subsequently allows the addition of scaladoc of what each join type 
does and how it operates
 * Removes possibilities of a typo on the join type
 * Provides compile time checking of the join type (only if an enum is used)

To clarify, if JoinType is a constant, it would just fill in the joinType 
string parameter for users. If an enum is used, it would restrict the domain of 
possible join types to whatever is defined in the future JoinType enum. The 
enum is preferred, however it would take longer to implement.
h3. *Q5.* Who cares? If you are successful, what difference will it make?

Developers using Apache Spark will care. This will make the join function 
easier to wield and lead to less runtime errors. It will save time by bringing 
join type validation at compile time. It will also provide in code reference to 
the join types, which saves the developer time of having to look up and 
navigate the multiple join functions to find the possible join types. In 
addition to that, the resulting constants/enum would have documentation on how 
that join type works.
h3. *Q6.* What are the risks?

Users of Apache Spark who currently use strings to define their join types 
could be impacted if an enum is chosen as the common definition. This risk can 
be mitigated by using string constants. The string constants would be the exact 
same string as the string literals used today. For example:
{code:java}
JoinType.INNER = "inner"
{code}
If an enum is still the preferred way of defining the join types, new join 
functions could be added that take in these enums and the join calls that 
contain string parameters for joinType could be deprecated. This would give 
developers a chance to change over to the new join types.
h3. *Q7.* How long will it take?

A few days for a seasoned Spark developer.
h3. *Q8.* What are the mid-term and final "exams" to check for success?

Mid-term exam would be the addition of a common definition of the join types 
and additional join functions that take in the join type enum/constant. The 
final exam would be working tests written to check the functionality of these 
new join functions and the join functions that take a string for joinType would 
be deprecated.
h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, 
if any. Backward and forward compatibility must be taken into account.

*If enums are used:*

The following join function signatures would be added to the Dataset API:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame
{code}
The following functions would be deprecated:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
def join(right: Dataset[_],

[jira] [Updated] (SPARK-26739) Standardized Join Types for DataFrames



 [ 
https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Skyler Lehan updated SPARK-26739:
-
Description: 
h3. *Q1.* What are you trying to do? Articulate your objectives using 
absolutely no jargon.

Currently, in the [join 
functions|"http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@join(right:org.apache.spark.sql.Dataset[_],joinExprs:org.apache.spark.sql.Column,joinType:String):org.apache.spark.sql.DataFrame"]
 on DataFrames, the join types are defined via a string parameter called 
joinType. In order for a developer to know which joins are possible, they must 
look up the API call for join. While this works fine, it can cause the 
developer to make a typo resulting in improper joins and/or unexpected errors. 
The objective of this improvement would be to allow developers to use a common 
definition for join types (by enum or constants) called JoinTypes. This would 
contain the possible joins and remove the possibility of a typo. It would also 
allow Spark to alter the names of the joins in the future without impacting 
end-users.
h3. *Q2.* What problem is this proposal NOT designed to solve?

The problem this solves is extremely narrow, it would not solve anything other 
than providing a common definition for join types.
h3. *Q3.* How is it done today, and what are the limits of current practice?

Currently, developers must join two DataFrames like so:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer")
{code}
Where they manually type the join type. As stated above, this:
 * Requires developers to manually type in the join
 * Can cause possibility of typos
 * Restricts renaming of join types as its a literal string
 * Does not restrict and/or compile check the join type being used, leading to 
runtime errors

h3. *Q4.* What is new in your approach and why do you think it will be 
successful?

The new approach would use constants, something like this:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
JoinType.LEFT_OUTER)
{code}
This would provide:
 * In code reference/definitions of the possible join types
 ** This subsequently allows the addition of scaladoc of what each join type 
does and how it operates
 * Removes possibilities of a typo on the join type
 * Provides compile time checking of the join type (only if an enum is used)

To clarify, if JoinType is a constant, it would just fill in the joinType 
string parameter for users. If an enum is used, it would restrict the domain of 
possible join types to whatever is defined in the future JoinType enum. The 
enum is preferred, however it would take longer to implement.
h3. *Q5.* Who cares? If you are successful, what difference will it make?

Developers using Apache Spark will care. This will make the join function 
easier to wield and lead to less runtime errors. It will save time by bringing 
join type validation at compile time. It will also provide in code reference to 
the join types, which saves the developer time of having to look up and 
navigate the multiple join functions to find the possible join types. In 
addition to that, the resulting constants/enum would have documentation on how 
that join type works.
h3. *Q6.* What are the risks?

Users of Apache Spark who currently use strings to define their join types 
could be impacted if an enum is chosen as the common definition. This risk can 
be mitigated by using string constants. The string constants would be the exact 
same string as the string literals used today. For example:
{code:java}
JoinType.INNER = "inner"
{code}
If an enum is still the preferred way of defining the join types, new join 
functions could be added that take in these enums and the join calls that 
contain string parameters for joinType could be deprecated. This would give 
developers a chance to change over to the new join types.
h3. *Q7.* How long will it take?

A few days for a seasoned Spark developer.
h3. *Q8.* What are the mid-term and final "exams" to check for success?

Mid-term exam would be the addition of a common definition of the join types 
and additional join functions that take in the join type enum/constant. The 
final exam would be working tests written to check the functionality of these 
new join functions and the join functions that take a string for joinType would 
be deprecated.
h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, 
if any. Backward and forward compatibility must be taken into account.

*If enums are used:*

The following join function signatures would be added to the Dataset API:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame
{code}
The following functions would be deprecated:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
def join(right: Dataset[_],

[jira] [Created] (SPARK-26739) Standardized Join Types for DataFrames

Skyler Lehan created SPARK-26739:

Summary: Standardized Join Types for DataFrames
Key: SPARK-26739
URL: https://issues.apache.org/jira/browse/SPARK-26739
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.4.0
Reporter: Skyler Lehan
Fix For: 2.4.1

h3. *Q1.* What are you trying to do? Articulate your objectives using
absolutely no jargon.

Currently, in the [join
functions|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@join(right:org.apache.spark.sql.Dataset[_],joinExprs:org.apache.spark.sql.Column,joinType:String):org.apache.spark.sql.DataFrame]
on DataFrames, the join types are defined via a string parameter called
joinType. In order for a developer to know which joins are possible, they must
look up the API call for join. While this works fine, it can cause the
developer to make a typo resulting in improper joins and/or unexpected errors.
The objective of this improvement would be to allow developers to use a common
definition for join types (by enum or constants) called JoinTypes. This would
contain the possible joins and remove the possibility of a typo. It would also
allow Spark to alter the names of the joins in the future without impacting
end-users.
h3. *Q2.* What problem is this proposal NOT designed to solve?

The problem this solves is extremely narrow, it would not solve anything other
than providing a common definition for join types.
h3. *Q3.* How is it done today, and what are the limits of current practice?

Currently, developers must join two DataFrames like so:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer")
{code}
Where they manually type the join type. As stated above, this:
* Requires developers to manually type in the join
* Can cause possibility of typos
* Restricts renaming of join types as its a literal string
* Does not restrict and/or compile check the join type being used, leading to
runtime errors

h3. *Q4.* What is new in your approach and why do you think it will be
successful?

The new approach would use constants, something like this:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"),
JoinType.LEFT_OUTER)
{code}
This would provide:
* In code reference/definitions of the possible join types
** This subsequently allows the addition of scaladoc of what each join type
does and how it operates
* Removes possibilities of a typo on the join type
* Provides compile time checking of the join type (only if an enum is used)

To clarify, if JoinType is a constant, it would just fill in the joinType
string parameter for users. If an enum is used, it would restrict the domain of
possible join types to whatever is defined in the future JoinType enum. The
enum is preferred, however it would take longer to implement.
h3. *Q5.* Who cares? If you are successful, what difference will it make?

Developers using Apache Spark will care. This will make the join function
easier to wield and lead to less runtime errors. It will save time by bringing
join type validation at compile time. It will also provide in code reference to
the join types, which saves the developer time of having to look up and
navigate the multiple join functions to find the possible join types. In
addition to that, the resulting constants/enum would have documentation on how
that join type works.
h3. *Q6.* What are the risks?

Users of Apache Spark who currently use strings to define their join types
could be impacted if an enum is chosen as the common definition. This risk can
be mitigated by using string constants. The string constants would be the exact
same string as the string literals used today. For example:
{code:java}
JoinType.INNER = "inner"
{code}
If an enum is still the preferred way of defining the join types, new join
functions could be added that take in these enums and the join calls that
contain string parameters for joinType could be deprecated. This would give
developers a chance to change over to the new join types.
h3. *Q7.* How long will it take?

A few days for a seasoned Spark developer.
h3. *Q8.* What are the mid-term and final "exams" to check for success?

Mid-term exam would be the addition of a common definition of the join types
and additional join functions that take in the join type enum/constant. The
final exam would be working tests written to check the functionality of these
new join functions and the join functions that take a string for joinType would
be deprecated.
h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes,
if any. Backward and forward compatibility must be taken into account.

*If enums are used:*

The following join function signatures would be added to the Dataset API:
{code:java}
def join(right: Dataset[_], joinExprs: Column,

[jira] [Assigned] (SPARK-26656) Benchmark for date/time functions and expressions



 [ 
https://issues.apache.org/jira/browse/SPARK-26656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26656:


Assignee: Apache Spark

> Benchmark for date/time functions and expressions
> -
>
> Key: SPARK-26656
> URL: https://issues.apache.org/jira/browse/SPARK-26656
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Write benchmarks for datetimeExressions



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26656) Benchmark for date/time functions and expressions



 [ 
https://issues.apache.org/jira/browse/SPARK-26656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26656:


Assignee: (was: Apache Spark)

> Benchmark for date/time functions and expressions
> -
>
> Key: SPARK-26656
> URL: https://issues.apache.org/jira/browse/SPARK-26656
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Write benchmarks for datetimeExressions



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes

2019-01-26 Thread Sergey (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753005#comment-16753005
 ] 

Sergey commented on SPARK-26688:


Hi Imran,

thanks for you reply.

"Meanwhile devs start to apply this willy-nilly, as these configs tend to just 
keep getting built up over time "

SLA could mitigate this problem. Every blacklisted node for a specific job 
slows it down in a long run. Anyway, dev would have to report / communicate 
with ops to resolve node issue. 

 

"Ideally, blacklisting and speculation should be able to prevent that problem"

We are going to try out speculation but we are not there yet. 

> Provide configuration of initially blacklisted YARN nodes
> -
>
> Key: SPARK-26688
> URL: https://issues.apache.org/jira/browse/SPARK-26688
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Introducing new config for initially blacklisted YARN nodes.
> This came up in the apache spark user mailing list: 
> [http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26738) Pyspark random forest classifier feature importance with column names

2019-01-26 Thread Praveen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen updated SPARK-26738:

Description: 
I am trying to plot the feature importances of random forest classifier with 
with column names. I am using Spark 2.3.2 and Pyspark.

The input X is sentences and i am using tfidf (HashingTF + IDF) + StringIndexer 
to generate the feature vectors.

I have included all the stages in a Pipeline

 
{code:java}
regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, outputCol= 
"words", pattern="[a-zA-Z_]+", toLowercase=True, 
minTokenLength=minimum_token_size)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", 
numFeatures=number_of_feature)
idf = IDF(inputCol="rawFeatures", outputCol= feature_vec_col)
indexer = StringIndexer(inputCol= label_col_name, outputCol= label_vec_name)
converter = IndexToString(inputCol='prediction', outputCol="original_label", 
labels=indexer.fit(df).labels)
feature_pipeline = Pipeline(stages=[regexTokenizer, hashingTF, idf, indexer])
estimator = RandomForestClassifier(labelCol=label_col, 
featuresCol=features_col, numTrees=100)
pipeline = Pipeline(stages=[feature_pipeline, estimator, converter])
model = pipeline.fit(df)
{code}
Generating the feature importances as
{code:java}
rdc = model.stages[-2]
print (rdc.featureImportances)
{code}
So far so good, but when i try to map the feature importances to the feature 
columns as below
{code:java}
attrs = sorted((attr["idx"], attr["name"]) for attr in 
(chain(*df_pred.schema["featurescol"].metadata["ml_attr"]["attrs"].values(

[(name, rdc.featureImportances[idx])
   for idx, name in attrs
   if dtModel_1.featureImportances[idx]]{code}
 

I get the key error on ml_attr
{code:java}
KeyError: 'ml_attr'{code}
The printed the dictionary,
{code:java}
print (df_pred.schema["featurescol"].metadata){code}
and it's empty {}

Any thoughts on what I am doing wrong ? How can I getting feature importances 
to the columns names.

Thanks

  was:
I am trying to plot the feature importances of random forest classifier with 
with column names. I am using Spark 2.3.2 and Pyspark.

The input X is sentences and i am using tfidf (HashingTF + IDF) + StringIndexer 
to generate the feature vectors.

I have included all the stages in a Pipeline

 

 

{{regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, 
outputCol= "words", pattern="[a-zA-Z_]+", toLowercase=True, 
minTokenLength=minimum_token_size) 

hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", 
numFeatures=number_of_feature) 

idf = IDF(inputCol="rawFeatures", outputCol= feature_vec_col) 

indexer = StringIndexer(inputCol= label_col_name, outputCol= label_vec_name) 

converter = IndexToString(inputCol='prediction', outputCol="original_label", 
labels=indexer.fit(df).labels) 

feature_pipeline = Pipeline(stages=[regexTokenizer, hashingTF, idf, indexer]) 

estimator = RandomForestClassifier(labelCol=label_col, 
featuresCol=features_col, numTrees=100) 

pipeline = Pipeline(stages=[feature_pipeline, estimator, converter])

model = pipeline.fit(df)}}{{}}

 

 

Generating the feature importances as

 
{code:java}
rdc = model.stages[-2]
print (rdc.featureImportances)
{code}
So far so good, but when i try to map the feature importances to the feature 
columns as below
{code:java}
attrs = sorted((attr["idx"], attr["name"]) for attr in 
(chain(*df_pred.schema["featurescol"].metadata["ml_attr"]["attrs"].values(

[(name, rdc.featureImportances[idx])
   for idx, name in attrs
   if dtModel_1.featureImportances[idx]]{code}
 

I get the key error on ml_attr
{code:java}
KeyError: 'ml_attr'{code}
The printed the dictionary,
{code:java}
print (df_pred.schema["featurescol"].metadata){code}
and it's empty {}

Any thoughts on what I am doing wrong ? How can I getting feature importances 
to the columns names.

Thanks


> Pyspark random forest classifier feature importance with column names
> -
>
> Key: SPARK-26738
> URL: https://issues.apache.org/jira/browse/SPARK-26738
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.3.2
> Environment: {code:java}
>  {code}
>Reporter: Praveen
>Priority: Major
>  Labels: RandomForest, pyspark
>
> I am trying to plot the feature importances of random forest classifier with 
> with column names. I am using Spark 2.3.2 and Pyspark.
> The input X is sentences and i am using tfidf (HashingTF + IDF) + 
> StringIndexer to generate the feature vectors.
> I have included all the stages in a Pipeline
>  
> {code:java}
> regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, 
> outputCol= "words", pattern="[a-zA-Z_]+", toLowercase=True, 
> minTokenLength=minimum_token_size)
> hashingTF = HashingTF(inputCol="words",

[jira] [Updated] (SPARK-26731) remove EOLed spark jobs from jenkins



 [ 
https://issues.apache.org/jira/browse/SPARK-26731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bogan updated SPARK-26731:

Attachment: LICENSE

> remove EOLed spark jobs from jenkins
> 
>
> Key: SPARK-26731
> URL: https://issues.apache.org/jira/browse/SPARK-26731
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 1.6.3, 2.0.2, 2.1.3
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: LICENSE, activemq-cli-tools-4a984ec.tar.gz
>
>
> i will disable, but not remove (yet), the branch-specific builds for 1.6, 2.0 
> and 2.1 on jenkins.
> these include all test builds, as well as docs, lint, compile, and packaging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26738) Pyspark random forest classifier feature importance with column names

2019-01-26 Thread Praveen (JIRA)

Praveen created SPARK-26738:
---

 Summary: Pyspark random forest classifier feature importance with 
column names
 Key: SPARK-26738
 URL: https://issues.apache.org/jira/browse/SPARK-26738
 Project: Spark
  Issue Type: Question
  Components: ML
Affects Versions: 2.3.2
 Environment: {code:java}
 {code}
Reporter: Praveen


I am trying to plot the feature importances of random forest classifier with 
with column names. I am using Spark 2.3.2 and Pyspark.

The input X is sentences and i am using tfidf (HashingTF + IDF) + StringIndexer 
to generate the feature vectors.

I have included all the stages in a Pipeline

 

 

{{regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, 
outputCol= "words", pattern="[a-zA-Z_]+", toLowercase=True, 
minTokenLength=minimum_token_size) 

hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", 
numFeatures=number_of_feature) 

idf = IDF(inputCol="rawFeatures", outputCol= feature_vec_col) 

indexer = StringIndexer(inputCol= label_col_name, outputCol= label_vec_name) 

converter = IndexToString(inputCol='prediction', outputCol="original_label", 
labels=indexer.fit(df).labels) 

feature_pipeline = Pipeline(stages=[regexTokenizer, hashingTF, idf, indexer]) 

estimator = RandomForestClassifier(labelCol=label_col, 
featuresCol=features_col, numTrees=100) 

pipeline = Pipeline(stages=[feature_pipeline, estimator, converter])

model = pipeline.fit(df)}}{{}}

 

 

Generating the feature importances as

 
{code:java}
rdc = model.stages[-2]
print (rdc.featureImportances)
{code}
So far so good, but when i try to map the feature importances to the feature 
columns as below
{code:java}
attrs = sorted((attr["idx"], attr["name"]) for attr in 
(chain(*df_pred.schema["featurescol"].metadata["ml_attr"]["attrs"].values(

[(name, rdc.featureImportances[idx])
   for idx, name in attrs
   if dtModel_1.featureImportances[idx]]{code}
 

I get the key error on ml_attr
{code:java}
KeyError: 'ml_attr'{code}
The printed the dictionary,
{code:java}
print (df_pred.schema["featurescol"].metadata){code}
and it's empty {}

Any thoughts on what I am doing wrong ? How can I getting feature importances 
to the columns names.

Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26722) add SPARK_TEST_KEY=1 to pull request builder and spark-master-test-sbt-hadoop-2.7



 [ 
https://issues.apache.org/jira/browse/SPARK-26722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bogan updated SPARK-26722:

Attachment: SPARK-26731.doc

> add SPARK_TEST_KEY=1 to pull request builder and 
> spark-master-test-sbt-hadoop-2.7
> -
>
> Key: SPARK-26722
> URL: https://issues.apache.org/jira/browse/SPARK-26722
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: SPARK-26731.doc
>
>
> from https://github.com/apache/spark/pull/23117:
> we need to add the {{SPARK_TEST_KEY=1}} env var to both the GHPRB and 
> {{spark-master-test-sbt-hadoop-2.7}} builds.
> this is done for the PRB, and was manually added to the 
> {{spark-master-test-sbt-hadoop-2.7}} build.
> i will leave this open until i finish porting the JJB configs in to the main 
> spark repo (for the {{spark-master-test-sbt-hadoop-2.7}} build).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-26731) remove EOLed spark jobs from jenkins



 [ 
https://issues.apache.org/jira/browse/SPARK-26731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bogan updated SPARK-26731:

Comment: was deleted

(was: {code:java}
// code placeholder
{code}
$ cd infrastructure/content/dev/ $ $EDITOR infra-site.mdtext $ svn commit ... 
wait a short while for the page to be rebuilt ... ... **ONLY IF** you are an 
ASF Member, then publish: ... $ curl -sL http://s.apache.org/cms-cli | perl})

> remove EOLed spark jobs from jenkins
> 
>
> Key: SPARK-26731
> URL: https://issues.apache.org/jira/browse/SPARK-26731
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 1.6.3, 2.0.2, 2.1.3
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: LICENSE, activemq-cli-tools-4a984ec.tar.gz
>
>
> i will disable, but not remove (yet), the branch-specific builds for 1.6, 2.0 
> and 2.1 on jenkins.
> these include all test builds, as well as docs, lint, compile, and packaging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26731) remove EOLed spark jobs from jenkins



[ 
https://issues.apache.org/jira/browse/SPARK-26731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16752988#comment-16752988
 ] 

Chris Bogan commented on SPARK-26731:
-

{code:java}
// code placeholder
{code}
$ cd infrastructure/content/dev/ $ $EDITOR infra-site.mdtext $ svn commit ... 
wait a short while for the page to be rebuilt ... ... **ONLY IF** you are an 
ASF Member, then publish: ... $ curl -sL http://s.apache.org/cms-cli | perl}

> remove EOLed spark jobs from jenkins
> 
>
> Key: SPARK-26731
> URL: https://issues.apache.org/jira/browse/SPARK-26731
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 1.6.3, 2.0.2, 2.1.3
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: LICENSE, activemq-cli-tools-4a984ec.tar.gz
>
>
> i will disable, but not remove (yet), the branch-specific builds for 1.6, 2.0 
> and 2.1 on jenkins.
> these include all test builds, as well as docs, lint, compile, and packaging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26327) Metrics in FileSourceScanExec not update correctly while relation.partitionSchema is set



 [ 
https://issues.apache.org/jira/browse/SPARK-26327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bogan updated SPARK-26327:

Attachment: request_handler_interface.json

> Metrics in FileSourceScanExec not update correctly while 
> relation.partitionSchema is set
> 
>
> Key: SPARK-26327
> URL: https://issues.apache.org/jira/browse/SPARK-26327
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>
> Attachments: Homepage - Material Design, 
> apache-opennlp-1.9.1-bin.tar.gz, request_handler_interface.json
>
>
> As currently approach in `FileSourceScanExec`, the metrics of "numFiles" and 
> "metadataTime"(fileListingTime) were updated while lazy val 
> `selectedPartitions` initialized in the scenario of relation.partitionSchema 
> is set. But `selectedPartitions` will be initialized by `metadata` at first, 
> which is called by `queryExecution.toString` in 
> `SQLExecution.withNewExecutionId`. So while the 
> `SQLMetrics.postDriverMetricUpdates` called, there's no corresponding 
> liveExecutions in SQLAppStatusListener, the metrics update is not work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26731) remove EOLed spark jobs from jenkins



 [ 
https://issues.apache.org/jira/browse/SPARK-26731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bogan updated SPARK-26731:

Attachment: activemq-cli-tools-4a984ec.tar.gz

> remove EOLed spark jobs from jenkins
> 
>
> Key: SPARK-26731
> URL: https://issues.apache.org/jira/browse/SPARK-26731
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 1.6.3, 2.0.2, 2.1.3
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: LICENSE, activemq-cli-tools-4a984ec.tar.gz
>
>
> i will disable, but not remove (yet), the branch-specific builds for 1.6, 2.0 
> and 2.1 on jenkins.
> these include all test builds, as well as docs, lint, compile, and packaging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26327) Metrics in FileSourceScanExec not update correctly while relation.partitionSchema is set



 [ 
https://issues.apache.org/jira/browse/SPARK-26327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bogan updated SPARK-26327:

Attachment: apache-opennlp-1.9.1-bin.tar.gz

> Metrics in FileSourceScanExec not update correctly while 
> relation.partitionSchema is set
> 
>
> Key: SPARK-26327
> URL: https://issues.apache.org/jira/browse/SPARK-26327
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>
> Attachments: Homepage - Material Design, 
> apache-opennlp-1.9.1-bin.tar.gz
>
>
> As currently approach in `FileSourceScanExec`, the metrics of "numFiles" and 
> "metadataTime"(fileListingTime) were updated while lazy val 
> `selectedPartitions` initialized in the scenario of relation.partitionSchema 
> is set. But `selectedPartitions` will be initialized by `metadata` at first, 
> which is called by `queryExecution.toString` in 
> `SQLExecution.withNewExecutionId`. So while the 
> `SQLMetrics.postDriverMetricUpdates` called, there's no corresponding 
> liveExecutions in SQLAppStatusListener, the metrics update is not work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26327) Metrics in FileSourceScanExec not update correctly while relation.partitionSchema is set