date:20190110

[jira] [Assigned] (SPARK-26599) BroardCast hint can not work with PruneFileSourcePartitions

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26599:


Assignee: (was: Apache Spark)

> BroardCast hint can not work with PruneFileSourcePartitions
> ---
>
> Key: SPARK-26599
> URL: https://issues.apache.org/jira/browse/SPARK-26599
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: eaton
>Priority: Major
>
> BroardCast hint can not work with `PruneFileSourcePartitions`, for example, 
> when the filter condition p is a partition field, table b in SQL below cannot 
> be broadcast.
> ` sql("select /*+ broadcastjoin(b) */ * from (select a from empty_test where 
> p=1) a " +
>  "join (select a,b from par_1 where p=1) b on a.a=b.a").explain`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26599) BroardCast hint can not work with PruneFileSourcePartitions

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26599:


Assignee: Apache Spark

> BroardCast hint can not work with PruneFileSourcePartitions
> ---
>
> Key: SPARK-26599
> URL: https://issues.apache.org/jira/browse/SPARK-26599
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: eaton
>Assignee: Apache Spark
>Priority: Major
>
> BroardCast hint can not work with `PruneFileSourcePartitions`, for example, 
> when the filter condition p is a partition field, table b in SQL below cannot 
> be broadcast.
> ` sql("select /*+ broadcastjoin(b) */ * from (select a from empty_test where 
> p=1) a " +
>  "join (select a,b from par_1 where p=1) b on a.a=b.a").explain`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26598) Fix HiveThriftServer2 set hiveconf and hivevar in every sql

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26598:


Assignee: (was: Apache Spark)

> Fix HiveThriftServer2 set hiveconf and hivevar in every sql
> ---
>
> Key: SPARK-26598
> URL: https://issues.apache.org/jira/browse/SPARK-26598
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: wangtao93
>Priority: Major
>
> [https://github.com/apache/spark/pull/17886,] this pr provide that 
> hiveserver2 support --haveconf  and --hivevar。But it set hiveconf and hivevar 
> in every sql in class SparkSQLOperationManager，i think this is not 
> suitable。So i make a little modify to set --hiveconf and --hivevar in class 
> SparkSQLSessionManager, it will only run once in open HiveServer2 session, 
> instead of ervery sql to init --hiveconf and --hivevar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26598) Fix HiveThriftServer2 set hiveconf and hivevar in every sql

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26598:


Assignee: Apache Spark

> Fix HiveThriftServer2 set hiveconf and hivevar in every sql
> ---
>
> Key: SPARK-26598
> URL: https://issues.apache.org/jira/browse/SPARK-26598
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: wangtao93
>Assignee: Apache Spark
>Priority: Major
>
> [https://github.com/apache/spark/pull/17886,] this pr provide that 
> hiveserver2 support --haveconf  and --hivevar。But it set hiveconf and hivevar 
> in every sql in class SparkSQLOperationManager，i think this is not 
> suitable。So i make a little modify to set --hiveconf and --hivevar in class 
> SparkSQLSessionManager, it will only run once in open HiveServer2 session, 
> instead of ervery sql to init --hiveconf and --hivevar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26599) BroardCast hint can not work with PruneFileSourcePartitions

2019-01-10 Thread eaton (JIRA)

eaton created SPARK-26599:
-

 Summary: BroardCast hint can not work with 
PruneFileSourcePartitions
 Key: SPARK-26599
 URL: https://issues.apache.org/jira/browse/SPARK-26599
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: eaton


BroardCast hint can not work with `PruneFileSourcePartitions`, for example, 
when the filter condition p is a partition field, table b in SQL below cannot 
be broadcast.

` sql("select /*+ broadcastjoin(b) */ * from (select a from empty_test where 
p=1) a " +
 "join (select a,b from par_1 where p=1) b on a.a=b.a").explain`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26580) remove Scala 2.11 hack for Scala UDF

2019-01-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26580.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23498
[https://github.com/apache/spark/pull/23498]

> remove Scala 2.11 hack for Scala UDF
> 
>
> Key: SPARK-26580
> URL: https://issues.apache.org/jira/browse/SPARK-26580
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26404) set spark.pyspark.python or PYSPARK_PYTHON doesn't work in k8s client-cluster mode.

2019-01-10 Thread Nihar Sheth (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16740011#comment-16740011
 ] 

Nihar Sheth commented on SPARK-26404:
-

Yup, ran into the same issue then. What I've learned is that PYSPARK_PYTHON 
isn't used by the executors, it looks like the driver propagates the value to 
the executors, so it needs to be set in the driver's environment. As for 
spark.pyspark.python, I'm not entirely sure why it doesn't work. @squito 
suggested that it might be a path only covered by spark-submit. I might look 
into it in a couple weeks when I have some spare cycles, but if anyone else 
wants to give it a go, feel free to ping me with any questions and I can try to 
help.

In the meanwhile, setting PYSPARK_PYTHON= in the driver was 
a valid workaround for me. Either os.environ['PYSPARK_PYTHON']= 
before SparkSession.builder, or exporting the environment variable in 
spark-env.sh

> set spark.pyspark.python or PYSPARK_PYTHON doesn't work in k8s client-cluster 
> mode.
> ---
>
> Key: SPARK-26404
> URL: https://issues.apache.org/jira/browse/SPARK-26404
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Dongqing  Liu
>Priority: Major
>
> Neither
>    conf.set("spark.executorEnv.PYSPARK_PYTHON", "/opt/pythonenvs/bin/python")
> nor 
>   conf.set("spark.pyspark.python", "/opt/pythonenvs/bin/python") 
> works. 
> Looks like the executor always picks python from PATH.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26598) Fix HiveThriftServer2 set hiveconf and hivevar in every sql

2019-01-10 Thread wangtao93 (JIRA)

wangtao93 created SPARK-26598:
-

 Summary: Fix HiveThriftServer2 set hiveconf and hivevar in every 
sql
 Key: SPARK-26598
 URL: https://issues.apache.org/jira/browse/SPARK-26598
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 2.3.0
Reporter: wangtao93


[https://github.com/apache/spark/pull/17886,] this pr provide that hiveserver2 
support --haveconf  and --hivevar。But it set hiveconf and hivevar in every sql 
in class SparkSQLOperationManager，i think this is not suitable。So i make a 
little modify to set --hiveconf and --hivevar in class SparkSQLSessionManager, 
it will only run once in open 

HiveServer2 session, instead of ervery sql to init --hiveconf and --hivevar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26598) Fix HiveThriftServer2 set hiveconf and hivevar in every sql

2019-01-10 Thread wangtao93 (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangtao93 updated SPARK-26598:
--
Description: [https://github.com/apache/spark/pull/17886,] this pr provide 
that hiveserver2 support --haveconf  and --hivevar。But it set hiveconf and 
hivevar in every sql in class SparkSQLOperationManager，i think this is not 
suitable。So i make a little modify to set --hiveconf and --hivevar in class 
SparkSQLSessionManager, it will only run once in open HiveServer2 session, 
instead of ervery sql to init --hiveconf and --hivevar  (was: 
[https://github.com/apache/spark/pull/17886,] this pr provide that hiveserver2 
support --haveconf  and --hivevar。But it set hiveconf and hivevar in every sql 
in class SparkSQLOperationManager，i think this is not suitable。So i make a 
little modify to set --hiveconf and --hivevar in class SparkSQLSessionManager, 
it will only run once in open 

HiveServer2 session, instead of ervery sql to init --hiveconf and --hivevar)

> Fix HiveThriftServer2 set hiveconf and hivevar in every sql
> ---
>
> Key: SPARK-26598
> URL: https://issues.apache.org/jira/browse/SPARK-26598
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: wangtao93
>Priority: Major
>
> [https://github.com/apache/spark/pull/17886,] this pr provide that 
> hiveserver2 support --haveconf  and --hivevar。But it set hiveconf and hivevar 
> in every sql in class SparkSQLOperationManager，i think this is not 
> suitable。So i make a little modify to set --hiveconf and --hivevar in class 
> SparkSQLSessionManager, it will only run once in open HiveServer2 session, 
> instead of ervery sql to init --hiveconf and --hivevar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26404) set spark.pyspark.python or PYSPARK_PYTHON doesn't work in k8s client-cluster mode.

2019-01-10 Thread Dongqing Liu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739980#comment-16739980
 ] 

Dongqing  Liu commented on SPARK-26404:
---

I use SparkSession in a jupyter notebook. 
{code:java}
SparkSession.builder.config(conf=conf).getOrCreate()
{code}

> set spark.pyspark.python or PYSPARK_PYTHON doesn't work in k8s client-cluster 
> mode.
> ---
>
> Key: SPARK-26404
> URL: https://issues.apache.org/jira/browse/SPARK-26404
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Dongqing  Liu
>Priority: Major
>
> Neither
>    conf.set("spark.executorEnv.PYSPARK_PYTHON", "/opt/pythonenvs/bin/python")
> nor 
>   conf.set("spark.pyspark.python", "/opt/pythonenvs/bin/python") 
> works. 
> Looks like the executor always picks python from PATH.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26597) Support using images with different entrypoints on Kubernetes

2019-01-10 Thread Patrick Clay (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Clay updated SPARK-26597:
-
External issue ID: https://github.com/jupyter/docker-stacks/issues/797

> Support using images with different entrypoints on Kubernetes
> -
>
> Key: SPARK-26597
> URL: https://issues.apache.org/jira/browse/SPARK-26597
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Patrick Clay
>Priority: Minor
>
> I wish to use arbitrary pre-existing docker images containing Spark with 
> Kubernetes.
> Specifically I wish to use 
> [jupyter/all-spark-notebook|https://hub.docker.com/r/jupyter/all-spark-notebook]
>  in in-cluster client mode ideally without image modification (I think using 
> images maintained by others is a key advantage of Docker).
> It has the full Spark 2.4 binary tarball with Spark's entrypoint.sh in it, 
> but I need to create a child image setting that as the entrypoint to it, 
> because Spark does not let the user specify a k8s command. I needed separate 
> images for kernel / driver and executor, because I need the kernel to have 
> Jupyter's entrypoint. Building and setting the executor image works, but is 
> obnoxious just to set the entrypoint.
> The crux of this FR is to add a property for executor (and driver) command to 
> point to entrypoint.sh
> I personally don't see why you even have entrypoint.sh instead of making the 
> command be _spark-class 
> org.apache.spark.executor.CoarseGrainedExecutorBackend ..._, which seems a 
> lot more portable (albeit reliant on the PATH).
> Speaking of reliance on PATH it also broke, because they didn't set 
> JAVA_HOME, and install tini from Conda putting it on a different path. These 
> are smaller issues. I'll file an issue on them and try to work them out 
> between here and there.
> In general shouldn't Spark on k8s be less coupled to the layout of the image? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26597) Support using images with different entrypoints on Kubernetes

2019-01-10 Thread Patrick Clay (JIRA)

Patrick Clay created SPARK-26597:


 Summary: Support using images with different entrypoints on 
Kubernetes
 Key: SPARK-26597
 URL: https://issues.apache.org/jira/browse/SPARK-26597
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Patrick Clay


I wish to use arbitrary pre-existing docker images containing Spark with 
Kubernetes.

Specifically I wish to use 
[jupyter/all-spark-notebook|https://hub.docker.com/r/jupyter/all-spark-notebook]
 in in-cluster client mode ideally without image modification (I think using 
images maintained by others is a key advantage of Docker).

It has the full Spark 2.4 binary tarball with Spark's entrypoint.sh in it, but 
I need to create a child image setting that as the entrypoint to it, because 
Spark does not let the user specify a k8s command. I needed separate images for 
kernel / driver and executor, because I need the kernel to have Jupyter's 
entrypoint. Building and setting the executor image works, but is obnoxious 
just to set the entrypoint.

The crux of this FR is to add a property for executor (and driver) command to 
point to entrypoint.sh

I personally don't see why you even have entrypoint.sh instead of making the 
command be _spark-class org.apache.spark.executor.CoarseGrainedExecutorBackend 
..._, which seems a lot more portable (albeit reliant on the PATH).

Speaking of reliance on PATH it also broke, because they didn't set JAVA_HOME, 
and install tini from Conda putting it on a different path. These are smaller 
issues. I'll file an issue on them and try to work them out between here and 
there.

In general shouldn't Spark on k8s be less coupled to the layout of the image? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-24009) spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/aaaaab'

2019-01-10 Thread ant_nebula (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ant_nebula updated SPARK-24009:
---
Comment: was deleted

(was: any progress here?  i met the same error too.)

> spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab' 
> -
>
> Key: SPARK-24009
> URL: https://issues.apache.org/jira/browse/SPARK-24009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: chris_j
>Priority: Major
>
> local mode  spark execute "INSERT OVERWRITE LOCAL DIRECTORY " successfully.
> on yarn spark execute "INSERT OVERWRITE LOCAL DIRECTORY " failed, not 
> permission problem also 
>  
> 1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row 
> format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
> default.dim_date"  write local directory successful
> 2.spark-sql  --master yarn -e "INSERT OVERWRITE DIRECTORY 'ab'row format 
> delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
> default.dim_date"  write hdfs successful
> 3.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY 
> '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS 
> TEXTFILE select * from default.dim_date"  on yarn write local directory failed
>  
>  
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: Mkdirs failed to create 
> [file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0]
>  (exists=false, 
> cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02])
>  at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
>  at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
>  at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
>  ... 8 more
>  Caused by: java.io.IOException: Mkdirs failed to create 
> [file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0]
>  (exists=false, 
> cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02])
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447)
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
>  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
>  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
>  at 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.getHiveRecordWriter(HiveIgnoreKeyTextOutputFormat.java:80)
>  at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261)
>  at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For

[jira] [Commented] (SPARK-24009) spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/aaaaab'

2019-01-10 Thread ant_nebula (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739952#comment-16739952
 ] 

ant_nebula commented on SPARK-24009:


Because on yarn client mode, "INSERT OVERWRITE LOCAL DIRECTORY" does not write 
back to dirver node,

but on the yarn nodemanager which the last task run, and you does not has the 
pemission to create directory '/home/spark/ab'.

Of cause, It is not reasonable. It should write back to driver.

> spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab' 
> -
>
> Key: SPARK-24009
> URL: https://issues.apache.org/jira/browse/SPARK-24009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: chris_j
>Priority: Major
>
> local mode  spark execute "INSERT OVERWRITE LOCAL DIRECTORY " successfully.
> on yarn spark execute "INSERT OVERWRITE LOCAL DIRECTORY " failed, not 
> permission problem also 
>  
> 1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row 
> format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
> default.dim_date"  write local directory successful
> 2.spark-sql  --master yarn -e "INSERT OVERWRITE DIRECTORY 'ab'row format 
> delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
> default.dim_date"  write hdfs successful
> 3.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY 
> '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS 
> TEXTFILE select * from default.dim_date"  on yarn write local directory failed
>  
>  
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: Mkdirs failed to create 
> [file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0]
>  (exists=false, 
> cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02])
>  at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
>  at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
>  at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
>  ... 8 more
>  Caused by: java.io.IOException: Mkdirs failed to create 
> [file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0]
>  (exists=false, 
> cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02])
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447)
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
>  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
>  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
>  at 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.getHiveRecordWriter(HiveIgnoreKeyTextOutputFormat.java:80)
>  at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261)
>  at 
>

[jira] [Created] (SPARK-26596) sparksql "insert overwrite local directory" does not write back to driver node

2019-01-10 Thread ant_nebula (JIRA)

ant_nebula created SPARK-26596:
--

 Summary: sparksql "insert overwrite local directory" does not 
write back to driver node
 Key: SPARK-26596
 URL: https://issues.apache.org/jira/browse/SPARK-26596
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: ant_nebula


/data/apps/spark/bin/spark-sql --master yarn --deploy-mode client 
--driver-memory 4g --executor-memory 4g --num-executors 3 -e "insert overwrite 
local directory '/tmp/spark_result/' select * from tableA limit 3"

This sql result does not write back to the driver machine which submit this 
sql.But on the yarn nodemanager machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26595) Allow delegation token renewal without a keytab

2019-01-10 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-26595:
--

 Summary: Allow delegation token renewal without a keytab
 Key: SPARK-26595
 URL: https://issues.apache.org/jira/browse/SPARK-26595
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Marcelo Vanzin


Currently the delegation token renewal feature requires the user to provide 
Spark with a keytab.

It would be nice for this to also be supported when the user doesn't have a 
keytab, as long as the user keeps a valid kerberos login. Spark has access to 
the user's credential cache in that case, and can keep tokens updated much like 
in the keytab case.

It's not as automatic as with keytabs, but can help in some environments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24493) Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3

2019-01-10 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24493.

Resolution: Later

Given that Spark does not publish Hadoop 2.8-based binaries, and I assume this 
problem does not exist in 2.7, I think there's nothing to do here.

When there's explicit support for Hadoop 3 it will surely be based on at least 
3.1.1 which has the fix.

So closing this one.

> Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3
> --
>
> Key: SPARK-24493
> URL: https://issues.apache.org/jira/browse/SPARK-24493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Asif M
>Priority: Major
>
> Kerberos Ticket Renewal is failing on long running spark job. I have added 
> below 2 kerberos properties in the HDFS configuration and ran a spark 
> streaming job 
> ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py])
> {noformat}
> dfs.namenode.delegation.token.max-lifetime=180 (30min)
> dfs.namenode.delegation.token.renew-interval=90 (15min)
> {noformat}
>  
> Spark Job failed at 15min with below error:
> {noformat}
> 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at 
> /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381)
>  failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage 
> 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 
> (TID 7290, , executor 1): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, 
> renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, 
> sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 
> 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499)
> at org.apache.hadoop.ipc.Client.call(Client.java:1445)
> at org.apache.hadoop.ipc.Client.call(Client.java:1355)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950)
> at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at

[jira] [Created] (SPARK-26594) DataSourceOptions.asMap should return CaseInsensitiveMap

2019-01-10 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-26594:


 Summary: DataSourceOptions.asMap should return CaseInsensitiveMap
 Key: SPARK-26594
 URL: https://issues.apache.org/jira/browse/SPARK-26594
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Shixiong Zhu


I'm pretty surprised that the following codes will fail.
{code}
import scala.collection.JavaConverters._
import org.apache.spark.sql.sources.v2.DataSourceOptions

val map = new DataSourceOptions(Map("fooBar" -> "x").asJava).asMap
assert(map.get("fooBar") == "x")
{code}

It's better to make DataSourceOptions.asMap return CaseInsensitiveMap.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24902) Add integration tests for PVs

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24902:


Assignee: Apache Spark

> Add integration tests for PVs
> -
>
> Key: SPARK-24902
> URL: https://issues.apache.org/jira/browse/SPARK-24902
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Assignee: Apache Spark
>Priority: Minor
>
> PVs and hostpath support has been added recently 
> (https://github.com/apache/spark/pull/21260/files) for Spark on K8s. 
> We should have some integration tests based on local storage. 
> It is easy to add PVs to minikube and attatch them to pods.
> We could target a known dir like /tmp for the PV path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24902) Add integration tests for PVs

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24902:


Assignee: (was: Apache Spark)

> Add integration tests for PVs
> -
>
> Key: SPARK-24902
> URL: https://issues.apache.org/jira/browse/SPARK-24902
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> PVs and hostpath support has been added recently 
> (https://github.com/apache/spark/pull/21260/files) for Spark on K8s. 
> We should have some integration tests based on local storage. 
> It is easy to add PVs to minikube and attatch them to pods.
> We could target a known dir like /tmp for the PV path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26586) Streaming queries should have isolated SparkSessions and confs

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26586:


Assignee: Apache Spark

> Streaming queries should have isolated SparkSessions and confs
> --
>
> Key: SPARK-26586
> URL: https://issues.apache.org/jira/browse/SPARK-26586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Mukul Murthy
>Assignee: Apache Spark
>Priority: Major
>
> When a stream is started, the stream's config is supposed to be frozen and 
> all batches run with the config at start time. However, due to a race 
> condition in creating streams, updating a conf value in the active spark 
> session immediately after starting a stream can lead to the stream getting 
> that updated value.
>  
> The problem is that when StreamingQueryManager creates a MicrobatchExecution 
> (or ContinuousExecution), it passes in the shared spark session, and the 
> spark session isn't cloned until StreamExecution.start() is called. 
> DataStreamWriter.start() should not return until the SparkSession is cloned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26586) Streaming queries should have isolated SparkSessions and confs

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26586:


Assignee: (was: Apache Spark)

> Streaming queries should have isolated SparkSessions and confs
> --
>
> Key: SPARK-26586
> URL: https://issues.apache.org/jira/browse/SPARK-26586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Mukul Murthy
>Priority: Major
>
> When a stream is started, the stream's config is supposed to be frozen and 
> all batches run with the config at start time. However, due to a race 
> condition in creating streams, updating a conf value in the active spark 
> session immediately after starting a stream can lead to the stream getting 
> that updated value.
>  
> The problem is that when StreamingQueryManager creates a MicrobatchExecution 
> (or ContinuousExecution), it passes in the shared spark session, and the 
> spark session isn't cloned until StreamExecution.start() is called. 
> DataStreamWriter.start() should not return until the SparkSession is cloned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26404) set spark.pyspark.python or PYSPARK_PYTHON doesn't work in k8s client-cluster mode.

2019-01-10 Thread Nihar Sheth (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739760#comment-16739760
 ] 

Nihar Sheth commented on SPARK-26404:
-

Hi, I think I ran into a similar issue. How are you calling Spark? Via 
spark-submit, or using pyspark.sql.SparkSession, or some other method?

> set spark.pyspark.python or PYSPARK_PYTHON doesn't work in k8s client-cluster 
> mode.
> ---
>
> Key: SPARK-26404
> URL: https://issues.apache.org/jira/browse/SPARK-26404
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Dongqing  Liu
>Priority: Major
>
> Neither
>    conf.set("spark.executorEnv.PYSPARK_PYTHON", "/opt/pythonenvs/bin/python")
> nor 
>   conf.set("spark.pyspark.python", "/opt/pythonenvs/bin/python") 
> works. 
> Looks like the executor always picks python from PATH.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26593) Use Proleptic Gregorian calendar in casting UTF8String to date/timestamp types

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26593:


Assignee: (was: Apache Spark)

> Use Proleptic Gregorian calendar in casting UTF8String to date/timestamp types
> --
>
> Key: SPARK-26593
> URL: https://issues.apache.org/jira/browse/SPARK-26593
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Current implementation of casting UTF8String to DateType/TimestampType uses 
> hybrid calendar (Gregorian + Julian). The ticket aims to unify conversion of 
> textual date/timestamp representation to DateType/TimestampType and use 
> Proleptic Gregorian calendar. More precisely, need to port stringToTimestamp 
> and stringToDate on java.time of Java 8 API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26593) Use Proleptic Gregorian calendar in casting UTF8String to date/timestamp types

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26593:


Assignee: Apache Spark

> Use Proleptic Gregorian calendar in casting UTF8String to date/timestamp types
> --
>
> Key: SPARK-26593
> URL: https://issues.apache.org/jira/browse/SPARK-26593
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Current implementation of casting UTF8String to DateType/TimestampType uses 
> hybrid calendar (Gregorian + Julian). The ticket aims to unify conversion of 
> textual date/timestamp representation to DateType/TimestampType and use 
> Proleptic Gregorian calendar. More precisely, need to port stringToTimestamp 
> and stringToDate on java.time of Java 8 API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26593) Use Proleptic Gregorian calendar in casting UTF8String to date/timestamp types

2019-01-10 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-26593:
--

 Summary: Use Proleptic Gregorian calendar in casting UTF8String to 
date/timestamp types
 Key: SPARK-26593
 URL: https://issues.apache.org/jira/browse/SPARK-26593
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Current implementation of casting UTF8String to DateType/TimestampType uses 
hybrid calendar (Gregorian + Julian). The ticket aims to unify conversion of 
textual date/timestamp representation to DateType/TimestampType and use 
Proleptic Gregorian calendar. More precisely, need to port stringToTimestamp 
and stringToDate on java.time of Java 8 API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26592) Kafka delegation token doesn't support proxy user

2019-01-10 Thread Gabor Somogyi (JIRA)

Gabor Somogyi created SPARK-26592:
-

 Summary: Kafka delegation token doesn't support proxy user
 Key: SPARK-26592
 URL: https://issues.apache.org/jira/browse/SPARK-26592
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Gabor Somogyi


Kafka is not yet support to obtain delegation token with proxy user. It has to 
be turned off until https://issues.apache.org/jira/browse/KAFKA-6945 not 
implemented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26565) modify dev/create-release/release-build.sh to let jenkins build packages w/o publishing

2019-01-10 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739684#comment-16739684
 ] 

shane knapp commented on SPARK-26565:
-

ok, the PR (https://github.com/apache/spark/pull/23492) should be g2g, but one 
conversation i had w/[~felixcheung] was if we wanted to provide nightly, 
unsigned packages for people to consume (at their own risk).

i'm ok w/this if it's what we want, but:
1) how useful can unsigned packages be?
2) anyone should be able to build their own dists locally

thoughts?

> modify dev/create-release/release-build.sh to let jenkins build packages w/o 
> publishing
> ---
>
> Key: SPARK-26565
> URL: https://issues.apache.org/jira/browse/SPARK-26565
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: fine.png, no-idea.jpg
>
>
> about a year+ ago, we stopped publishing releases directly from jenkins...
> this means that the spark-\{branch}-packaging builds are failing due to gpg 
> signing failures, and i would like to update these builds to *just* perform 
> packaging.
> example:
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]
> i propose to change dev/create-release/release-build.sh...
> when the script is called w/the 'package' option, add an {{if}} statement to 
> skip the following sections when run on jenkins:
> 1) gpg signing of the source tarball (lines 184-187)
> 2) gpg signing of the sparkR dist (lines 243-248)
> 3) gpg signing of the python dist (lines 256-261)
> 4) gpg signing of the regular binary dist (lines 264-271)
> 5) the svn push of the signed dists (lines 317-332)
>  
> -another, and probably much better option, is to nuke the 
> spark-\{branch}-packaging builds and create new ones that just build things 
> w/o touching this incredible fragile shell scripting nightmare.-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26592) Kafka delegation token doesn't support proxy user

2019-01-10 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739681#comment-16739681
 ] 

Gabor Somogyi commented on SPARK-26592:
---

Here is the KIP: 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-373%3A+Allow+users+to+create+delegation+tokens+for+other+users

> Kafka delegation token doesn't support proxy user
> -
>
> Key: SPARK-26592
> URL: https://issues.apache.org/jira/browse/SPARK-26592
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> Kafka is not yet support to obtain delegation token with proxy user. It has 
> to be turned off until https://issues.apache.org/jira/browse/KAFKA-6945 
> implemented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache

2019-01-10 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739679#comment-16739679
 ] 

Gabor Somogyi commented on SPARK-26385:
---

[~stud3nt] are you guys using dynamic allocation? We've seen similar problems 
when that was enabled.

> YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in 
> cache
> ---
>
> Key: SPARK-26385
> URL: https://issues.apache.org/jira/browse/SPARK-26385
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Hadoop 2.6.0, Spark 2.4.0
>Reporter: T M
>Priority: Major
>
>  
> Hello,
>  
> I have Spark Structured Streaming job which is runnning on YARN(Hadoop 2.6.0, 
> Spark 2.4.0). After 25-26 hours, my job stops working with following error:
> {code:java}
> 2018-12-16 22:35:17 ERROR 
> org.apache.spark.internal.Logging$class.logError(Logging.scala:91): Query 
> TestQuery[id = a61ce197-1d1b-4e82-a7af-60162953488b, runId = 
> a56878cf-dfc7-4f6a-ad48-02cf738ccc2f] terminated with error 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for REMOVED: HDFS_DELEGATION_TOKEN owner=REMOVED, renewer=yarn, 
> realUser=, issueDate=1544903057122, maxDate=1545507857122, 
> sequenceNumber=10314, masterKeyId=344) can't be found in cache at 
> org.apache.hadoop.ipc.Client.call(Client.java:1470) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1401) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>  at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752)
>  at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>  at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1977) at 
> org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:133) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1120) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1116) at 
> org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at 
> org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1116) at 
> org.apache.hadoop.fs.FileContext$Util.exists(FileContext.java:1581) at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.exists(CheckpointFileManager.scala:326)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:142)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:544)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:554)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
>

[jira] [Updated] (SPARK-26592) Kafka delegation token doesn't support proxy user

2019-01-10 Thread Gabor Somogyi (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-26592:
--
Description: Kafka is not yet support to obtain delegation token with proxy 
user. It has to be turned off until 
https://issues.apache.org/jira/browse/KAFKA-6945 implemented.  (was: Kafka is 
not yet support to obtain delegation token with proxy user. It has to be turned 
off until https://issues.apache.org/jira/browse/KAFKA-6945 not implemented.)

> Kafka delegation token doesn't support proxy user
> -
>
> Key: SPARK-26592
> URL: https://issues.apache.org/jira/browse/SPARK-26592
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> Kafka is not yet support to obtain delegation token with proxy user. It has 
> to be turned off until https://issues.apache.org/jira/browse/KAFKA-6945 
> implemented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26592) Kafka delegation token doesn't support proxy user

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26592:


Assignee: (was: Apache Spark)

> Kafka delegation token doesn't support proxy user
> -
>
> Key: SPARK-26592
> URL: https://issues.apache.org/jira/browse/SPARK-26592
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> Kafka is not yet support to obtain delegation token with proxy user. It has 
> to be turned off until https://issues.apache.org/jira/browse/KAFKA-6945 
> implemented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26592) Kafka delegation token doesn't support proxy user

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26592:


Assignee: Apache Spark

> Kafka delegation token doesn't support proxy user
> -
>
> Key: SPARK-26592
> URL: https://issues.apache.org/jira/browse/SPARK-26592
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Major
>
> Kafka is not yet support to obtain delegation token with proxy user. It has 
> to be turned off until https://issues.apache.org/jira/browse/KAFKA-6945 
> implemented.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26564) Fix wrong assertions and error messages for parameter checking

2019-01-10 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26564:
-

Assignee: Kengo Seki

> Fix wrong assertions and error messages for parameter checking
> --
>
> Key: SPARK-26564
> URL: https://issues.apache.org/jira/browse/SPARK-26564
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core, SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Minor
>  Labels: starter
>
> I mistakenly set an equivalent value with spark.network.timeout to 
> spark.executor.heartbeatInterval and got the following error:
> {code}
> java.lang.IllegalArgumentException: requirement failed: The value of 
> spark.network.timeout=120s must be no less than the value of 
> spark.executor.heartbeatInterval=120s.
> {code}
> But it can be read as they could be equal. "Greater than" is more precise 
> than "no less than".
> 
> In addition, the following assertions are inconsistent with their messages 
> and the messages are right.
> {code:title=mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala}
>  91   require(maxIter >= 0, s"maxIter must be a positive integer: $maxIter")
> {code}
> {code:title=sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala}
> 416   require(capacity < 51200, "Cannot broadcast more than 512 
> millions rows")
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26564) Fix wrong assertions and error messages for parameter checking

2019-01-10 Thread Kengo Seki (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kengo Seki updated SPARK-26564:
---
Component/s: SQL

> Fix wrong assertions and error messages for parameter checking
> --
>
> Key: SPARK-26564
> URL: https://issues.apache.org/jira/browse/SPARK-26564
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core, SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Kengo Seki
>Priority: Minor
>  Labels: starter
>
> I mistakenly set an equivalent value with spark.network.timeout to 
> spark.executor.heartbeatInterval and got the following error:
> {code}
> java.lang.IllegalArgumentException: requirement failed: The value of 
> spark.network.timeout=120s must be no less than the value of 
> spark.executor.heartbeatInterval=120s.
> {code}
> But it can be read as they could be equal. "Greater than" is more precise 
> than "no less than".
> 
> In addition, the following assertions are inconsistent with their messages 
> and the messages are right.
> {code:title=mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala}
>  91   require(maxIter >= 0, s"maxIter must be a positive integer: $maxIter")
> {code}
> {code:title=sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala}
> 416   require(capacity < 51200, "Cannot broadcast more than 512 
> millions rows")
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26591) illegal hardware instruction

2019-01-10 Thread Elchin (JIRA)

Elchin created SPARK-26591:
--

 Summary: illegal hardware instruction
 Key: SPARK-26591
 URL: https://issues.apache.org/jira/browse/SPARK-26591
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0
 Environment: Python 3.6.7

Pyspark 2.4.0

OS:
{noformat}
Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 
x86_64 x86_64 GNU/Linux{noformat}
CPU:

 
{code:java}
Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB
clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz
{code}
 

 
Reporter: Elchin


When I try to use pandas_udf from examples in 
[documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]:
{code:java}
from pyspark.sql.functions import pandas_udf, PandasUDFType

from pyspark.sql.types import IntegerType, StringType

slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is 
crashed{code}
I get the error:
{code:java}
[1]    17969 illegal hardware instruction (core dumped)  python3{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26590) make fetch-block-to-disk backward compatible

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26590:


Assignee: Apache Spark  (was: Wenchen Fan)

> make fetch-block-to-disk backward compatible
> 
>
> Key: SPARK-26590
> URL: https://issues.apache.org/jira/browse/SPARK-26590
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26590) make fetch-block-to-disk backward compatible

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26590:


Assignee: Wenchen Fan  (was: Apache Spark)

> make fetch-block-to-disk backward compatible
> 
>
> Key: SPARK-26590
> URL: https://issues.apache.org/jira/browse/SPARK-26590
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26590) make fetch-block-to-disk backward compatible

2019-01-10 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-26590:
---

 Summary: make fetch-block-to-disk backward compatible
 Key: SPARK-26590
 URL: https://issues.apache.org/jira/browse/SPARK-26590
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26584) Remove `spark.sql.orc.copyBatchToSpark` internal configuration

2019-01-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26584.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23503

> Remove `spark.sql.orc.copyBatchToSpark` internal configuration
> --
>
> Key: SPARK-26584
> URL: https://issues.apache.org/jira/browse/SPARK-26584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>
> This issue aims to remove internal ORC configuration to simplify the code 
> path for Spark 3.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark

2019-01-10 Thread Xiangrui Meng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739565#comment-16739565
 ] 

Xiangrui Meng commented on SPARK-24374:
---

[~luzengxiang] Did you use Python API or Scala API for barrier mode? In 
PySpark, Python worker process launches the external user process 
(MPI/TensorFlow). However, the Python worker process might receive SIGKILL 
directly from Python daemon. Hence we cannot rely on the task closure to clean 
up (or we change how Spark terminates Python worker). A workaround is to watch 
the parent Python worker process in the MPI job. If the parent is gone, 
terminate itself. If you are on linux, you might use PR_SET_PDEATHSIG from 
prctl. 

> SPIP: Support Barrier Execution Mode in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26564) Fix misleading error message about spark.network.timeout and spark.executor.heartbeatInterval

2019-01-10 Thread Kengo Seki (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kengo Seki updated SPARK-26564:
---
   Priority: Minor  (was: Trivial)
Description: 
I mistakenly set an equivalent value with spark.network.timeout to 
spark.executor.heartbeatInterval and got the following error:

{code}
java.lang.IllegalArgumentException: requirement failed: The value of 
spark.network.timeout=120s must be no less than the value of 
spark.executor.heartbeatInterval=120s.
{code}

But it can be read as they could be equal. "Greater than" is more precise than 
"no less than".



In addition, the following assertions are inconsistent with their messages and 
the messages are right.

{code:title=mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala}
 91   require(maxIter >= 0, s"maxIter must be a positive integer: $maxIter")
{code}

{code:title=sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala}
416   require(capacity < 51200, "Cannot broadcast more than 512 
millions rows")
{code}

  was:
I mistakenly set an equivalent value with spark.network.timeout to 
spark.executor.heartbeatInterval and got the following error:

{code}
java.lang.IllegalArgumentException: requirement failed: The value of 
spark.network.timeout=120s must be no less than the value of 
spark.executor.heartbeatInterval=120s.
{code}

But it can be read as they could be equal. "Greater than" is more precise than 
"no less than".

Component/s: MLlib

> Fix misleading error message about spark.network.timeout and 
> spark.executor.heartbeatInterval
> -
>
> Key: SPARK-26564
> URL: https://issues.apache.org/jira/browse/SPARK-26564
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Kengo Seki
>Priority: Minor
>  Labels: starter
>
> I mistakenly set an equivalent value with spark.network.timeout to 
> spark.executor.heartbeatInterval and got the following error:
> {code}
> java.lang.IllegalArgumentException: requirement failed: The value of 
> spark.network.timeout=120s must be no less than the value of 
> spark.executor.heartbeatInterval=120s.
> {code}
> But it can be read as they could be equal. "Greater than" is more precise 
> than "no less than".
> 
> In addition, the following assertions are inconsistent with their messages 
> and the messages are right.
> {code:title=mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala}
>  91   require(maxIter >= 0, s"maxIter must be a positive integer: $maxIter")
> {code}
> {code:title=sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala}
> 416   require(capacity < 51200, "Cannot broadcast more than 512 
> millions rows")
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26564) Fix wrong assertions and error messages for parameter checking

2019-01-10 Thread Kengo Seki (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kengo Seki updated SPARK-26564:
---
Summary: Fix wrong assertions and error messages for parameter checking  
(was: Fix misleading error message about spark.network.timeout and 
spark.executor.heartbeatInterval)

> Fix wrong assertions and error messages for parameter checking
> --
>
> Key: SPARK-26564
> URL: https://issues.apache.org/jira/browse/SPARK-26564
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Kengo Seki
>Priority: Minor
>  Labels: starter
>
> I mistakenly set an equivalent value with spark.network.timeout to 
> spark.executor.heartbeatInterval and got the following error:
> {code}
> java.lang.IllegalArgumentException: requirement failed: The value of 
> spark.network.timeout=120s must be no less than the value of 
> spark.executor.heartbeatInterval=120s.
> {code}
> But it can be read as they could be equal. "Greater than" is more precise 
> than "no less than".
> 
> In addition, the following assertions are inconsistent with their messages 
> and the messages are right.
> {code:title=mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala}
>  91   require(maxIter >= 0, s"maxIter must be a positive integer: $maxIter")
> {code}
> {code:title=sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala}
> 416   require(capacity < 51200, "Cannot broadcast more than 512 
> millions rows")
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26589) proper `median` method for spark dataframe

2019-01-10 Thread Jan Gorecki (JIRA)

Jan Gorecki created SPARK-26589:
---

 Summary: proper `median` method for spark dataframe
 Key: SPARK-26589
 URL: https://issues.apache.org/jira/browse/SPARK-26589
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Jan Gorecki


I found multiple tickets asking for median function to be implemented in Spark. 
Most of those tickets links to "SPARK-6761 Approximate quantile" as duplicate 
of it. The thing is that approximate quantile is a workaround for lack of 
median function. Thus I am filling this Feature Request for proper, exact, not 
approximation of, median function. I am aware about difficulties that are 
caused by distributed environment when trying to compute median, nevertheless I 
don't think those difficulties is reason good enough to drop out `median` 
function from scope of Spark. I am not asking about efficient median but exact 
median.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26539) Remove spark.memory.useLegacyMode memory settings + StaticMemoryManager

2019-01-10 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26539.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23457
[https://github.com/apache/spark/pull/23457]

> Remove spark.memory.useLegacyMode memory settings + StaticMemoryManager
> ---
>
> Key: SPARK-26539
> URL: https://issues.apache.org/jira/browse/SPARK-26539
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> The old memory manager was superseded by UnifiedMemoryManager in Spark 1.6, 
> and has been the default since. I think we could remove it to simplify the 
> code a little, but more importantly to reduce the variety of memory settings 
> users are confronted with when using Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26588) Idle executor should properly be killed when no job is submitted

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26588:


Assignee: (was: Apache Spark)

> Idle executor should properly be killed when no job is submitted
> 
>
> Key: SPARK-26588
> URL: https://issues.apache.org/jira/browse/SPARK-26588
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0
>Reporter: Qingxin Wu
>Priority: Major
>
> I enable dynamic allocation feature with *spark-shell* and do not submit any 
> task. After *spark.dynamicAllocation.executorIdleTimeout* seconds(default 
> 60s), there is still one active executor, which is abnormal. All idle 
> executors are timeout and should be removed.(default 
> *spark.dynamicAllocation.minExecutors*=0). The spark-shell command show below:
> {code:java}
> spark-shell --master=yarn --conf spark.ui.port=8040 --conf 
> spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.maxExecutors=8 --conf 
> spark.dynamicAllocation.initialExecutors=4 --conf 
> spark.shuffle.service.enabled=true
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26588) Idle executor should properly be killed when no job is submitted

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26588:


Assignee: Apache Spark

> Idle executor should properly be killed when no job is submitted
> 
>
> Key: SPARK-26588
> URL: https://issues.apache.org/jira/browse/SPARK-26588
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0
>Reporter: Qingxin Wu
>Assignee: Apache Spark
>Priority: Major
>
> I enable dynamic allocation feature with *spark-shell* and do not submit any 
> task. After *spark.dynamicAllocation.executorIdleTimeout* seconds(default 
> 60s), there is still one active executor, which is abnormal. All idle 
> executors are timeout and should be removed.(default 
> *spark.dynamicAllocation.minExecutors*=0). The spark-shell command show below:
> {code:java}
> spark-shell --master=yarn --conf spark.ui.port=8040 --conf 
> spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.maxExecutors=8 --conf 
> spark.dynamicAllocation.initialExecutors=4 --conf 
> spark.shuffle.service.enabled=true
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26588) Idle executor should properly be killed when no job is submitted

2019-01-10 Thread Qingxin Wu (JIRA)

Qingxin Wu created SPARK-26588:
--

 Summary: Idle executor should properly be killed when no job is 
submitted
 Key: SPARK-26588
 URL: https://issues.apache.org/jira/browse/SPARK-26588
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 2.4.0
Reporter: Qingxin Wu


I enable dynamic allocation feature with *spark-shell* and do not submit any 
task. After *spark.dynamicAllocation.executorIdleTimeout* seconds(default 60s), 
there is still one active executor, which is abnormal. All idle executors are 
timeout and should be removed.(default 
*spark.dynamicAllocation.minExecutors*=0). The spark-shell command show below:
{code:java}
spark-shell --master=yarn --conf spark.ui.port=8040 --conf 
spark.dynamicAllocation.enabled=true --conf 
spark.dynamicAllocation.maxExecutors=8 --conf 
spark.dynamicAllocation.initialExecutors=4 --conf 
spark.shuffle.service.enabled=true
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21351) Update nullability based on children's output in optimized logical plan

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21351:


Assignee: Takeshi Yamamuro  (was: Apache Spark)

> Update nullability based on children's output in optimized logical plan
> ---
>
> Key: SPARK-21351
> URL: https://issues.apache.org/jira/browse/SPARK-21351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> In the master, optimized plans do not respect the nullability that `Filter` 
> might change when having `IsNotNull`.
> This generates unnecessary code for NULL checks. For example:
> {code}
> scala> val df = Seq((Some(1), Some(2))).toDF("a", "b")
> scala> val bIsNotNull = df.where($"b" =!= 2).select($"b")
> scala> val targetQuery = bIsNotNull.distinct
> scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable
> res5: Boolean = true
> scala> targetQuery.debugCodegen
> Found 2 WholeStageCodegen subtrees.
> == Subtree 1 / 2 ==
> *HashAggregate(keys=[b#19], functions=[], output=[b#19])
> +- Exchange hashpartitioning(b#19, 200)
>+- *HashAggregate(keys=[b#19], functions=[], output=[b#19])
>   +- *Project [_2#16 AS b#19]
>  +- *Filter isnotnull(_2#16)
> +- LocalTableScan [_1#15, _2#16]
> Generated code:
> ...
> /* 124 */   protected void processNext() throws java.io.IOException {
> ...
> /* 132 */ // output the result
> /* 133 */
> /* 134 */ while (agg_mapIter.next()) {
> /* 135 */   wholestagecodegen_numOutputRows.add(1);
> /* 136 */   UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey();
> /* 137 */   UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue();
> /* 138 */
> /* 139 */   boolean agg_isNull4 = agg_aggKey.isNullAt(0);
> /* 140 */   int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0));
> /* 141 */   agg_rowWriter1.zeroOutNullBytes();
> /* 142 */
> // We don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`
> /* 143 */   if (agg_isNull4) {
> /* 144 */ agg_rowWriter1.setNullAt(0);
> /* 145 */   } else {
> /* 146 */ agg_rowWriter1.write(0, agg_value4);
> /* 147 */   }
> /* 148 */   append(agg_result1);
> /* 149 */
> /* 150 */   if (shouldStop()) return;
> /* 151 */ }
> /* 152 */
> /* 153 */ agg_mapIter.close();
> /* 154 */ if (agg_sorter == null) {
> /* 155 */   agg_hashMap.free();
> /* 156 */ }
> /* 157 */   }
> /* 158 */
> /* 159 */ }
> {code}
> In the line 143, we don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21351) Update nullability based on children's output in optimized logical plan

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21351:


Assignee: Apache Spark  (was: Takeshi Yamamuro)

> Update nullability based on children's output in optimized logical plan
> ---
>
> Key: SPARK-21351
> URL: https://issues.apache.org/jira/browse/SPARK-21351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> In the master, optimized plans do not respect the nullability that `Filter` 
> might change when having `IsNotNull`.
> This generates unnecessary code for NULL checks. For example:
> {code}
> scala> val df = Seq((Some(1), Some(2))).toDF("a", "b")
> scala> val bIsNotNull = df.where($"b" =!= 2).select($"b")
> scala> val targetQuery = bIsNotNull.distinct
> scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable
> res5: Boolean = true
> scala> targetQuery.debugCodegen
> Found 2 WholeStageCodegen subtrees.
> == Subtree 1 / 2 ==
> *HashAggregate(keys=[b#19], functions=[], output=[b#19])
> +- Exchange hashpartitioning(b#19, 200)
>+- *HashAggregate(keys=[b#19], functions=[], output=[b#19])
>   +- *Project [_2#16 AS b#19]
>  +- *Filter isnotnull(_2#16)
> +- LocalTableScan [_1#15, _2#16]
> Generated code:
> ...
> /* 124 */   protected void processNext() throws java.io.IOException {
> ...
> /* 132 */ // output the result
> /* 133 */
> /* 134 */ while (agg_mapIter.next()) {
> /* 135 */   wholestagecodegen_numOutputRows.add(1);
> /* 136 */   UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey();
> /* 137 */   UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue();
> /* 138 */
> /* 139 */   boolean agg_isNull4 = agg_aggKey.isNullAt(0);
> /* 140 */   int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0));
> /* 141 */   agg_rowWriter1.zeroOutNullBytes();
> /* 142 */
> // We don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`
> /* 143 */   if (agg_isNull4) {
> /* 144 */ agg_rowWriter1.setNullAt(0);
> /* 145 */   } else {
> /* 146 */ agg_rowWriter1.write(0, agg_value4);
> /* 147 */   }
> /* 148 */   append(agg_result1);
> /* 149 */
> /* 150 */   if (shouldStop()) return;
> /* 151 */ }
> /* 152 */
> /* 153 */ agg_mapIter.close();
> /* 154 */ if (agg_sorter == null) {
> /* 155 */   agg_hashMap.free();
> /* 156 */ }
> /* 157 */   }
> /* 158 */
> /* 159 */ }
> {code}
> In the line 143, we don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26587) Deadlock between SparkUI thread and Driver thread

2019-01-10 Thread Vitaliy Savkin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitaliy Savkin updated SPARK-26587:
---
Description: 
One time in a month (~1000 runs) one of our spark applications freezes at 
startup. jstack says that there is a deadlock. Please see locks 
0x802c00c0 and 0x8271bb98 in stacktraces below.
{noformat}
"Driver":
at java.lang.Package.getSystemPackage(Package.java:540)
- waiting to lock <0x802c00c0> (a java.util.HashMap)
at java.lang.ClassLoader.getPackage(ClassLoader.java:1625)
at java.net.URLClassLoader.getAndVerifyPackage(URLClassLoader.java:394)
at java.net.URLClassLoader.definePackageInternal(URLClassLoader.java:420)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:452)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
- locked <0x82789598> (a 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:221)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:210)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
- locked <0x82789540> (a 
org.apache.spark.sql.internal.NonClosableMutableURLClassLoader)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:370)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at javax.xml.parsers.FactoryFinder$1.run(FactoryFinder.java:294)
at java.security.AccessController.doPrivileged(Native Method)
at javax.xml.parsers.FactoryFinder.findServiceProvider(FactoryFinder.java:289)
at javax.xml.parsers.FactoryFinder.find(FactoryFinder.java:267)
at 
javax.xml.parsers.DocumentBuilderFactory.newInstance(DocumentBuilderFactory.java:120)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2516)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2492)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2405)
- locked <0x8271bb98> (a org.apache.hadoop.conf.Configuration)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:981)
at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1031)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2189)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2702)
at 
org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74)
at java.net.URL.getURLStreamHandler(URL.java:1142)
at java.net.URL.(URL.java:599)
at java.net.URL.(URL.java:490)
at java.net.URL.(URL.java:439)
at java.net.JarURLConnection.parseSpecs(JarURLConnection.java:175)
at java.net.JarURLConnection.(JarURLConnection.java:158)
at sun.net.www.protocol.jar.JarURLConnection.(JarURLConnection.java:81)
at sun.net.www.protocol.jar.Handler.openConnection(Handler.java:41)
at java.net.URL.openConnection(URL.java:979)
at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:238)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:216)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:210)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
- locked <0x82789540> (a 
org.apache.spark.sql.internal.NonClosableMutableURLClassLoader)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:262)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:362)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:266)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
- locked <0x8302a120> (a org.apache.spark.sql.hive.HiveExternalCatalog)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
-

[jira] [Issue Comment Deleted] (SPARK-21351) Update nullability based on children's output in optimized logical plan

2019-01-10 Thread Takeshi Yamamuro (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-21351:
-
Comment: was deleted

(was: This fix has been reverted by https://github.com/apache/spark/pull/23390, 
so I will revisit later.)

> Update nullability based on children's output in optimized logical plan
> ---
>
> Key: SPARK-21351
> URL: https://issues.apache.org/jira/browse/SPARK-21351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> In the master, optimized plans do not respect the nullability that `Filter` 
> might change when having `IsNotNull`.
> This generates unnecessary code for NULL checks. For example:
> {code}
> scala> val df = Seq((Some(1), Some(2))).toDF("a", "b")
> scala> val bIsNotNull = df.where($"b" =!= 2).select($"b")
> scala> val targetQuery = bIsNotNull.distinct
> scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable
> res5: Boolean = true
> scala> targetQuery.debugCodegen
> Found 2 WholeStageCodegen subtrees.
> == Subtree 1 / 2 ==
> *HashAggregate(keys=[b#19], functions=[], output=[b#19])
> +- Exchange hashpartitioning(b#19, 200)
>+- *HashAggregate(keys=[b#19], functions=[], output=[b#19])
>   +- *Project [_2#16 AS b#19]
>  +- *Filter isnotnull(_2#16)
> +- LocalTableScan [_1#15, _2#16]
> Generated code:
> ...
> /* 124 */   protected void processNext() throws java.io.IOException {
> ...
> /* 132 */ // output the result
> /* 133 */
> /* 134 */ while (agg_mapIter.next()) {
> /* 135 */   wholestagecodegen_numOutputRows.add(1);
> /* 136 */   UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey();
> /* 137 */   UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue();
> /* 138 */
> /* 139 */   boolean agg_isNull4 = agg_aggKey.isNullAt(0);
> /* 140 */   int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0));
> /* 141 */   agg_rowWriter1.zeroOutNullBytes();
> /* 142 */
> // We don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`
> /* 143 */   if (agg_isNull4) {
> /* 144 */ agg_rowWriter1.setNullAt(0);
> /* 145 */   } else {
> /* 146 */ agg_rowWriter1.write(0, agg_value4);
> /* 147 */   }
> /* 148 */   append(agg_result1);
> /* 149 */
> /* 150 */   if (shouldStop()) return;
> /* 151 */ }
> /* 152 */
> /* 153 */ agg_mapIter.close();
> /* 154 */ if (agg_sorter == null) {
> /* 155 */   agg_hashMap.free();
> /* 156 */ }
> /* 157 */   }
> /* 158 */
> /* 159 */ }
> {code}
> In the line 143, we don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21351) Update nullability based on children's output in optimized logical plan

2019-01-10 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739315#comment-16739315
 ] 

Takeshi Yamamuro commented on SPARK-21351:
--

This fix has been reverted by https://github.com/apache/spark/pull/23390, so I 
will revisit later.

> Update nullability based on children's output in optimized logical plan
> ---
>
> Key: SPARK-21351
> URL: https://issues.apache.org/jira/browse/SPARK-21351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> In the master, optimized plans do not respect the nullability that `Filter` 
> might change when having `IsNotNull`.
> This generates unnecessary code for NULL checks. For example:
> {code}
> scala> val df = Seq((Some(1), Some(2))).toDF("a", "b")
> scala> val bIsNotNull = df.where($"b" =!= 2).select($"b")
> scala> val targetQuery = bIsNotNull.distinct
> scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable
> res5: Boolean = true
> scala> targetQuery.debugCodegen
> Found 2 WholeStageCodegen subtrees.
> == Subtree 1 / 2 ==
> *HashAggregate(keys=[b#19], functions=[], output=[b#19])
> +- Exchange hashpartitioning(b#19, 200)
>+- *HashAggregate(keys=[b#19], functions=[], output=[b#19])
>   +- *Project [_2#16 AS b#19]
>  +- *Filter isnotnull(_2#16)
> +- LocalTableScan [_1#15, _2#16]
> Generated code:
> ...
> /* 124 */   protected void processNext() throws java.io.IOException {
> ...
> /* 132 */ // output the result
> /* 133 */
> /* 134 */ while (agg_mapIter.next()) {
> /* 135 */   wholestagecodegen_numOutputRows.add(1);
> /* 136 */   UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey();
> /* 137 */   UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue();
> /* 138 */
> /* 139 */   boolean agg_isNull4 = agg_aggKey.isNullAt(0);
> /* 140 */   int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0));
> /* 141 */   agg_rowWriter1.zeroOutNullBytes();
> /* 142 */
> // We don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`
> /* 143 */   if (agg_isNull4) {
> /* 144 */ agg_rowWriter1.setNullAt(0);
> /* 145 */   } else {
> /* 146 */ agg_rowWriter1.write(0, agg_value4);
> /* 147 */   }
> /* 148 */   append(agg_result1);
> /* 149 */
> /* 150 */   if (shouldStop()) return;
> /* 151 */ }
> /* 152 */
> /* 153 */ agg_mapIter.close();
> /* 154 */ if (agg_sorter == null) {
> /* 155 */   agg_hashMap.free();
> /* 156 */ }
> /* 157 */   }
> /* 158 */
> /* 159 */ }
> {code}
> In the line 143, we don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21351) Update nullability based on children's output in optimized logical plan

2019-01-10 Thread Takeshi Yamamuro (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-21351:
-
Affects Version/s: (was: 2.1.1)
   2.2.2

> Update nullability based on children's output in optimized logical plan
> ---
>
> Key: SPARK-21351
> URL: https://issues.apache.org/jira/browse/SPARK-21351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> In the master, optimized plans do not respect the nullability that `Filter` 
> might change when having `IsNotNull`.
> This generates unnecessary code for NULL checks. For example:
> {code}
> scala> val df = Seq((Some(1), Some(2))).toDF("a", "b")
> scala> val bIsNotNull = df.where($"b" =!= 2).select($"b")
> scala> val targetQuery = bIsNotNull.distinct
> scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable
> res5: Boolean = true
> scala> targetQuery.debugCodegen
> Found 2 WholeStageCodegen subtrees.
> == Subtree 1 / 2 ==
> *HashAggregate(keys=[b#19], functions=[], output=[b#19])
> +- Exchange hashpartitioning(b#19, 200)
>+- *HashAggregate(keys=[b#19], functions=[], output=[b#19])
>   +- *Project [_2#16 AS b#19]
>  +- *Filter isnotnull(_2#16)
> +- LocalTableScan [_1#15, _2#16]
> Generated code:
> ...
> /* 124 */   protected void processNext() throws java.io.IOException {
> ...
> /* 132 */ // output the result
> /* 133 */
> /* 134 */ while (agg_mapIter.next()) {
> /* 135 */   wholestagecodegen_numOutputRows.add(1);
> /* 136 */   UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey();
> /* 137 */   UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue();
> /* 138 */
> /* 139 */   boolean agg_isNull4 = agg_aggKey.isNullAt(0);
> /* 140 */   int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0));
> /* 141 */   agg_rowWriter1.zeroOutNullBytes();
> /* 142 */
> // We don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`
> /* 143 */   if (agg_isNull4) {
> /* 144 */ agg_rowWriter1.setNullAt(0);
> /* 145 */   } else {
> /* 146 */ agg_rowWriter1.write(0, agg_value4);
> /* 147 */   }
> /* 148 */   append(agg_result1);
> /* 149 */
> /* 150 */   if (shouldStop()) return;
> /* 151 */ }
> /* 152 */
> /* 153 */ agg_mapIter.close();
> /* 154 */ if (agg_sorter == null) {
> /* 155 */   agg_hashMap.free();
> /* 156 */ }
> /* 157 */   }
> /* 158 */
> /* 159 */ }
> {code}
> In the line 143, we don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26587) Deadlock between SparkUI thread and Driver thread

2019-01-10 Thread Vitaliy Savkin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitaliy Savkin updated SPARK-26587:
---
Description: 
One time in a month (~1000 runs) one of our spark applications freezes. jstack 
says that there is a deadlock. Please see locks 0x802c00c0 and 
0x8271bb98 in stacktraces below.
{noformat}
"Driver":
at java.lang.Package.getSystemPackage(Package.java:540)
- waiting to lock <0x802c00c0> (a java.util.HashMap)
at java.lang.ClassLoader.getPackage(ClassLoader.java:1625)
at java.net.URLClassLoader.getAndVerifyPackage(URLClassLoader.java:394)
at java.net.URLClassLoader.definePackageInternal(URLClassLoader.java:420)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:452)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
- locked <0x82789598> (a 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:221)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:210)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
- locked <0x82789540> (a 
org.apache.spark.sql.internal.NonClosableMutableURLClassLoader)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:370)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at javax.xml.parsers.FactoryFinder$1.run(FactoryFinder.java:294)
at java.security.AccessController.doPrivileged(Native Method)
at javax.xml.parsers.FactoryFinder.findServiceProvider(FactoryFinder.java:289)
at javax.xml.parsers.FactoryFinder.find(FactoryFinder.java:267)
at 
javax.xml.parsers.DocumentBuilderFactory.newInstance(DocumentBuilderFactory.java:120)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2516)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2492)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2405)
- locked <0x8271bb98> (a org.apache.hadoop.conf.Configuration)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:981)
at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1031)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2189)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2702)
at 
org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74)
at java.net.URL.getURLStreamHandler(URL.java:1142)
at java.net.URL.(URL.java:599)
at java.net.URL.(URL.java:490)
at java.net.URL.(URL.java:439)
at java.net.JarURLConnection.parseSpecs(JarURLConnection.java:175)
at java.net.JarURLConnection.(JarURLConnection.java:158)
at sun.net.www.protocol.jar.JarURLConnection.(JarURLConnection.java:81)
at sun.net.www.protocol.jar.Handler.openConnection(Handler.java:41)
at java.net.URL.openConnection(URL.java:979)
at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:238)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:216)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:210)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
- locked <0x82789540> (a 
org.apache.spark.sql.internal.NonClosableMutableURLClassLoader)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:262)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:362)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:266)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
- locked <0x8302a120> (a org.apache.spark.sql.hive.HiveExternalCatalog)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
- locked

[jira] [Updated] (SPARK-21351) Update nullability based on children's output in optimized logical plan

2019-01-10 Thread Takeshi Yamamuro (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-21351:
-
Affects Version/s: 2.3.2

> Update nullability based on children's output in optimized logical plan
> ---
>
> Key: SPARK-21351
> URL: https://issues.apache.org/jira/browse/SPARK-21351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> In the master, optimized plans do not respect the nullability that `Filter` 
> might change when having `IsNotNull`.
> This generates unnecessary code for NULL checks. For example:
> {code}
> scala> val df = Seq((Some(1), Some(2))).toDF("a", "b")
> scala> val bIsNotNull = df.where($"b" =!= 2).select($"b")
> scala> val targetQuery = bIsNotNull.distinct
> scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable
> res5: Boolean = true
> scala> targetQuery.debugCodegen
> Found 2 WholeStageCodegen subtrees.
> == Subtree 1 / 2 ==
> *HashAggregate(keys=[b#19], functions=[], output=[b#19])
> +- Exchange hashpartitioning(b#19, 200)
>+- *HashAggregate(keys=[b#19], functions=[], output=[b#19])
>   +- *Project [_2#16 AS b#19]
>  +- *Filter isnotnull(_2#16)
> +- LocalTableScan [_1#15, _2#16]
> Generated code:
> ...
> /* 124 */   protected void processNext() throws java.io.IOException {
> ...
> /* 132 */ // output the result
> /* 133 */
> /* 134 */ while (agg_mapIter.next()) {
> /* 135 */   wholestagecodegen_numOutputRows.add(1);
> /* 136 */   UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey();
> /* 137 */   UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue();
> /* 138 */
> /* 139 */   boolean agg_isNull4 = agg_aggKey.isNullAt(0);
> /* 140 */   int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0));
> /* 141 */   agg_rowWriter1.zeroOutNullBytes();
> /* 142 */
> // We don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`
> /* 143 */   if (agg_isNull4) {
> /* 144 */ agg_rowWriter1.setNullAt(0);
> /* 145 */   } else {
> /* 146 */ agg_rowWriter1.write(0, agg_value4);
> /* 147 */   }
> /* 148 */   append(agg_result1);
> /* 149 */
> /* 150 */   if (shouldStop()) return;
> /* 151 */ }
> /* 152 */
> /* 153 */ agg_mapIter.close();
> /* 154 */ if (agg_sorter == null) {
> /* 155 */   agg_hashMap.free();
> /* 156 */ }
> /* 157 */   }
> /* 158 */
> /* 159 */ }
> {code}
> In the line 143, we don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21351) Update nullability based on children's output in optimized logical plan

2019-01-10 Thread Takeshi Yamamuro (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-21351:
-
Fix Version/s: (was: 2.4.0)

> Update nullability based on children's output in optimized logical plan
> ---
>
> Key: SPARK-21351
> URL: https://issues.apache.org/jira/browse/SPARK-21351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> In the master, optimized plans do not respect the nullability that `Filter` 
> might change when having `IsNotNull`.
> This generates unnecessary code for NULL checks. For example:
> {code}
> scala> val df = Seq((Some(1), Some(2))).toDF("a", "b")
> scala> val bIsNotNull = df.where($"b" =!= 2).select($"b")
> scala> val targetQuery = bIsNotNull.distinct
> scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable
> res5: Boolean = true
> scala> targetQuery.debugCodegen
> Found 2 WholeStageCodegen subtrees.
> == Subtree 1 / 2 ==
> *HashAggregate(keys=[b#19], functions=[], output=[b#19])
> +- Exchange hashpartitioning(b#19, 200)
>+- *HashAggregate(keys=[b#19], functions=[], output=[b#19])
>   +- *Project [_2#16 AS b#19]
>  +- *Filter isnotnull(_2#16)
> +- LocalTableScan [_1#15, _2#16]
> Generated code:
> ...
> /* 124 */   protected void processNext() throws java.io.IOException {
> ...
> /* 132 */ // output the result
> /* 133 */
> /* 134 */ while (agg_mapIter.next()) {
> /* 135 */   wholestagecodegen_numOutputRows.add(1);
> /* 136 */   UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey();
> /* 137 */   UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue();
> /* 138 */
> /* 139 */   boolean agg_isNull4 = agg_aggKey.isNullAt(0);
> /* 140 */   int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0));
> /* 141 */   agg_rowWriter1.zeroOutNullBytes();
> /* 142 */
> // We don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`
> /* 143 */   if (agg_isNull4) {
> /* 144 */ agg_rowWriter1.setNullAt(0);
> /* 145 */   } else {
> /* 146 */ agg_rowWriter1.write(0, agg_value4);
> /* 147 */   }
> /* 148 */   append(agg_result1);
> /* 149 */
> /* 150 */   if (shouldStop()) return;
> /* 151 */ }
> /* 152 */
> /* 153 */ agg_mapIter.close();
> /* 154 */ if (agg_sorter == null) {
> /* 155 */   agg_hashMap.free();
> /* 156 */ }
> /* 157 */   }
> /* 158 */
> /* 159 */ }
> {code}
> In the line 143, we don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-21351) Update nullability based on children's output in optimized logical plan

2019-01-10 Thread Takeshi Yamamuro (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro reopened SPARK-21351:
--

> Update nullability based on children's output in optimized logical plan
> ---
>
> Key: SPARK-21351
> URL: https://issues.apache.org/jira/browse/SPARK-21351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.4.0
>
>
> In the master, optimized plans do not respect the nullability that `Filter` 
> might change when having `IsNotNull`.
> This generates unnecessary code for NULL checks. For example:
> {code}
> scala> val df = Seq((Some(1), Some(2))).toDF("a", "b")
> scala> val bIsNotNull = df.where($"b" =!= 2).select($"b")
> scala> val targetQuery = bIsNotNull.distinct
> scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable
> res5: Boolean = true
> scala> targetQuery.debugCodegen
> Found 2 WholeStageCodegen subtrees.
> == Subtree 1 / 2 ==
> *HashAggregate(keys=[b#19], functions=[], output=[b#19])
> +- Exchange hashpartitioning(b#19, 200)
>+- *HashAggregate(keys=[b#19], functions=[], output=[b#19])
>   +- *Project [_2#16 AS b#19]
>  +- *Filter isnotnull(_2#16)
> +- LocalTableScan [_1#15, _2#16]
> Generated code:
> ...
> /* 124 */   protected void processNext() throws java.io.IOException {
> ...
> /* 132 */ // output the result
> /* 133 */
> /* 134 */ while (agg_mapIter.next()) {
> /* 135 */   wholestagecodegen_numOutputRows.add(1);
> /* 136 */   UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey();
> /* 137 */   UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue();
> /* 138 */
> /* 139 */   boolean agg_isNull4 = agg_aggKey.isNullAt(0);
> /* 140 */   int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0));
> /* 141 */   agg_rowWriter1.zeroOutNullBytes();
> /* 142 */
> // We don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`
> /* 143 */   if (agg_isNull4) {
> /* 144 */ agg_rowWriter1.setNullAt(0);
> /* 145 */   } else {
> /* 146 */ agg_rowWriter1.write(0, agg_value4);
> /* 147 */   }
> /* 148 */   append(agg_result1);
> /* 149 */
> /* 150 */   if (shouldStop()) return;
> /* 151 */ }
> /* 152 */
> /* 153 */ agg_mapIter.close();
> /* 154 */ if (agg_sorter == null) {
> /* 155 */   agg_hashMap.free();
> /* 156 */ }
> /* 157 */   }
> /* 158 */
> /* 159 */ }
> {code}
> In the line 143, we don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26587) Deadlock between SparkUI thread and Driver thread

2019-01-10 Thread Vitaliy Savkin (JIRA)

Vitaliy Savkin created SPARK-26587:
--

 Summary: Deadlock between SparkUI thread and Driver thread  
 Key: SPARK-26587
 URL: https://issues.apache.org/jira/browse/SPARK-26587
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
 Environment: EMR 5.9.0
Reporter: Vitaliy Savkin


One time in a month (~1000 runs) one of our spark applications freezes. jstack 
says that there is a deadlock. Please see locks 0x802c00c0 and 
0x8271bb98 in stacktraces below.
{noformat}
"Driver":
at java.lang.Package.getSystemPackage(Package.java:540)
- waiting to lock <0x802c00c0> (a java.util.HashMap)
at java.lang.ClassLoader.getPackage(ClassLoader.java:1625)
at java.net.URLClassLoader.getAndVerifyPackage(URLClassLoader.java:394)
at java.net.URLClassLoader.definePackageInternal(URLClassLoader.java:420)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:452)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
- locked <0x82789598> (a 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:221)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:210)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
- locked <0x82789540> (a 
org.apache.spark.sql.internal.NonClosableMutableURLClassLoader)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:370)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at javax.xml.parsers.FactoryFinder$1.run(FactoryFinder.java:294)
at java.security.AccessController.doPrivileged(Native Method)
at javax.xml.parsers.FactoryFinder.findServiceProvider(FactoryFinder.java:289)
at javax.xml.parsers.FactoryFinder.find(FactoryFinder.java:267)
at 
javax.xml.parsers.DocumentBuilderFactory.newInstance(DocumentBuilderFactory.java:120)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2516)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2492)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2405)
- locked <0x8271bb98> (a org.apache.hadoop.conf.Configuration)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:981)
at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1031)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2189)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2702)
at 
org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74)
at java.net.URL.getURLStreamHandler(URL.java:1142)
at java.net.URL.(URL.java:599)
at java.net.URL.(URL.java:490)
at java.net.URL.(URL.java:439)
at java.net.JarURLConnection.parseSpecs(JarURLConnection.java:175)
at java.net.JarURLConnection.(JarURLConnection.java:158)
at sun.net.www.protocol.jar.JarURLConnection.(JarURLConnection.java:81)
at sun.net.www.protocol.jar.Handler.openConnection(Handler.java:41)
at java.net.URL.openConnection(URL.java:979)
at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:238)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:216)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:210)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
- locked <0x82789540> (a 
org.apache.spark.sql.internal.NonClosableMutableURLClassLoader)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:262)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:362)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:266)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
- locked <0x8302a120> (a org.apache.spark.sql.hive.HiveExternalCatalog)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194)
at

[jira] [Resolved] (SPARK-26459) remove UpdateNullabilityInAttributeReferences

2019-01-10 Thread Takeshi Yamamuro (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-26459.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23390
https://github.com/apache/spark/pull/23390

> remove UpdateNullabilityInAttributeReferences
> -
>
> Key: SPARK-26459
> URL: https://issues.apache.org/jira/browse/SPARK-26459
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26574) Cloud sql stronge

2019-01-10 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739237#comment-16739237
 ] 

Hyukjin Kwon commented on SPARK-26574:
--

Please avoid to set Critical+ which is usually reserved for committers.

> Cloud sql stronge
> -
>
> Key: SPARK-26574
> URL: https://issues.apache.org/jira/browse/SPARK-26574
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Mesos, SQL
>Affects Versions: 2.3.2
>Reporter: Roufique Hossain
>Priority: Major
>   Original Estimate: 8,509h
>  Remaining Estimate: 8,509h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26574) Cloud sql stronge

2019-01-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26574:
-
Priority: Major  (was: Blocker)

> Cloud sql stronge
> -
>
> Key: SPARK-26574
> URL: https://issues.apache.org/jira/browse/SPARK-26574
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Mesos, SQL
>Affects Versions: 2.3.2
>Reporter: Roufique Hossain
>Priority: Major
>   Original Estimate: 8,509h
>  Remaining Estimate: 8,509h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26574) Cloud sql stronge

2019-01-10 Thread Roufique Hossain (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roufique Hossain updated SPARK-26574:
-
Priority: Blocker  (was: Major)

> Cloud sql stronge
> -
>
> Key: SPARK-26574
> URL: https://issues.apache.org/jira/browse/SPARK-26574
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Mesos, SQL
>Affects Versions: 2.3.2
>Reporter: Roufique Hossain
>Priority: Blocker
>   Original Estimate: 8,509h
>  Remaining Estimate: 8,509h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26583) Add `paranamer` dependency to `core` module

2019-01-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26583.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 3.0.0
   2.4.1

This is resolved via https://github.com/apache/spark/pull/23502

> Add `paranamer` dependency to `core` module
> ---
>
> Key: SPARK-26583
> URL: https://issues.apache.org/jira/browse/SPARK-26583
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> With Scala-2.12 profile, Spark application fails while Spark is okay. For 
> example, our documented `SimpleApp` example succeeds to compile but it fails 
> at runtime because it doesn't use `paranamer 2.8` and hits SPARK-22128.
> https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html
> {code}
> $ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer
> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple ---
> [INFO] my.test:simple:jar:1.0-SNAPSHOT
> [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile
> [INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile
> [INFO]   \- org.apache.avro:avro:jar:1.8.2:compile
> [INFO]  \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26572) Join on distinct column with monotonically_increasing_id produces wrong output

2019-01-10 Thread JIRA



 [ 
https://issues.apache.org/jira/browse/SPARK-26572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sören Reichardt updated SPARK-26572:

Description: 
When joining a table with projected monotonically_increasing_id column after 
calling distinct with another table the operators do not get executed in the 
right order. 

Here is a minimal example:
{code:java}
import org.apache.spark.sql.{DataFrame, SparkSession, functions}

object JoinBug extends App {

  // Spark session setup
  val session =  SparkSession.builder().master("local[*]").getOrCreate()
  import session.sqlContext.implicits._
  session.sparkContext.setLogLevel("error")

  // Bug in Spark: "monotonically_increasing_id" is pushed down when it 
shouldn't be. Push down only happens when the
  // DF containing the "monotonically_increasing_id" expression is on the left 
side of the join.
  val baseTable = Seq((1), (1)).toDF("idx")
  val distinctWithId = baseTable.distinct.withColumn("id", 
functions.monotonically_increasing_id())
  val monotonicallyOnRight: DataFrame = baseTable.join(distinctWithId, "idx")
  val monotonicallyOnLeft: DataFrame = distinctWithId.join(baseTable, "idx")

  monotonicallyOnLeft.show // Wrong
  monotonicallyOnRight.show // Ok in Spark 2.2.2 - also wrong in Spark 2.4.0

}

{code}
It produces the following output:
{code:java}
Wrong:
+---++
|idx| id |
+---++
| 1|369367187456 |
| 1|369367187457 |
+---++

Right:
+---++
|idx| id |
+---++
| 1|369367187456 |
| 1|369367187456 |
+---++
{code}
We assume that the join operator triggers a pushdown of expressions 
(monotonically_increasing_id in this case) which gets pushed down to be 
executed before distinct. This produces non-distinct rows with unique id's. 
However it seems like this behavior only appears if the table with the 
projected expression is on the left side of the join in Spark 2.2.2 (for 
version 2.4.0 it fails on both joins).

  was:
When joining a table with projected monotonically_increasing_id column after 
calling distinct with another table the operators do not get executed in the 
right order. 

Here is a minimal example:
{code:java}
import org.apache.spark.sql.{DataFrame, SparkSession, functions}

object JoinBug extends App {

  // Spark session setup
  val session =  SparkSession.builder().master("local[*]").getOrCreate()
  import session.sqlContext.implicits._
  session.sparkContext.setLogLevel("error")

  // Bug in Spark: "monotonically_increasing_id" is pushed down when it 
shouldn't be. Push down only happens when the
  // DF containing the "monotonically_increasing_id" expression is on the left 
side of the join.
  val baseTable = Seq((1), (1)).toDF("idx")
  val distinctWithId = baseTable.distinct.withColumn("id", 
functions.monotonically_increasing_id())
  val monotonicallyOnRight: DataFrame = baseTable.join(distinctWithId, "idx")
  val monotonicallyOnLeft: DataFrame = distinctWithId.join(baseTable, "idx")

  monotonicallyOnLeft.show // Wrong
  monotonicallyOnRight.show // Ok in Spark 2.2.2 - also wrong in Spark 2.4.0

}

{code}
It produces the following output:
{code:java}
Wrong:
+---++
|idx| id |
+---++
| 1|369367187456 |
| 1|369367187457 |
+---++

Right:
+---++
|idx| id |
+---++
| 1|369367187456 |
| 1|369367187456 |
+---++
{code}
We assume that the join operator triggers a pushdown of expressions 
(monotonically_increasing_id in this case) which gets pushed down to be 
executed before distinct. This produces non-distinct rows with unique id's. 
However it seems like this behavior only appears if the table with the 
projected expression is on the left side of the join.


> Join on distinct column with monotonically_increasing_id produces wrong output
> --
>
> Key: SPARK-26572
> URL: https://issues.apache.org/jira/browse/SPARK-26572
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.4.0
> Environment: Running on Ubuntu 18.04LTS and Intellij 2018.2.5
>Reporter: Sören Reichardt
>Priority: Major
>
> When joining a table with projected monotonically_increasing_id column after 
> calling distinct with another table the operators do not get executed in the 
> right order. 
> Here is a minimal example:
> {code:java}
> import org.apache.spark.sql.{DataFrame, SparkSession, functions}
> object JoinBug extends App {
>   // Spark session setup
>   val session =  SparkSession.builder().master("local[*]").getOrCreate()
>   import session.sqlContext.implicits._
>   session.sparkContext.setLogLevel("error")
>   // Bug in Spark: "monotonically_increasing_id" is pushed down when it 
> shouldn't be. Push down only happens when the

[jira] [Assigned] (SPARK-26576) Broadcast hint not applied to partitioned table

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26576:


Assignee: Apache Spark

> Broadcast hint not applied to partitioned table
> ---
>
> Key: SPARK-26576
> URL: https://issues.apache.org/jira/browse/SPARK-26576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: John Zhuge
>Assignee: Apache Spark
>Priority: Major
>
> Broadcast hint is not applied to partitioned Parquet table. Below 
> "SortMergeJoin" is chosen incorrectly and "ResolvedHit(broadcast)" is removed 
> in Optimized Plan.
> {noformat}
> scala> spark.sql("CREATE TABLE jzhuge.parquet_with_part (val STRING) 
> PARTITIONED BY (dateint INT) STORED AS parquet")
> scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
> scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => 
> df.join(broadcast(df), "dateint").explain(true))
> == Parsed Logical Plan ==
> 'Join UsingJoin(Inner,List(dateint))
> :- SubqueryAlias `jzhuge`.`parquet_with_part`
> :  +- Relation[val#28,dateint#29] parquet
> +- ResolvedHint (broadcast)
>+- SubqueryAlias `jzhuge`.`parquet_with_part`
>   +- Relation[val#32,dateint#33] parquet
> == Analyzed Logical Plan ==
> dateint: int, val: string, val: string
> Project [dateint#29, val#28, val#32]
> +- Join Inner, (dateint#29 = dateint#33)
>:- SubqueryAlias `jzhuge`.`parquet_with_part`
>:  +- Relation[val#28,dateint#29] parquet
>+- ResolvedHint (broadcast)
>   +- SubqueryAlias `jzhuge`.`parquet_with_part`
>  +- Relation[val#32,dateint#33] parquet
> == Optimized Logical Plan ==
> Project [dateint#29, val#28, val#32]
> +- Join Inner, (dateint#29 = dateint#33)
>:- Project [val#28, dateint#29]
>:  +- Filter isnotnull(dateint#29)
>: +- Relation[val#28,dateint#29] parquet
>+- Project [val#32, dateint#33]
>   +- Filter isnotnull(dateint#33)
>  +- Relation[val#32,dateint#33] parquet
> == Physical Plan ==
> *(5) Project [dateint#29, val#28, val#32]
> +- *(5) SortMergeJoin [dateint#29], [dateint#33], Inner
>:- *(2) Sort [dateint#29 ASC NULLS FIRST], false, 0
>:  +- Exchange(coordinator id: 55629191) hashpartitioning(dateint#29, 
> 500), coordinator[target post-shuffle partition size: 67108864]
>: +- *(1) FileScan parquet jzhuge.parquet_with_part[val#28,dateint#29] 
> Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[], 
> PartitionCount: 0, PartitionFilters: [isnotnull(dateint#29)], PushedFilters: 
> [], ReadSchema: struct
>+- *(4) Sort [dateint#33 ASC NULLS FIRST], false, 0
>   +- ReusedExchange [val#32, dateint#33], Exchange(coordinator id: 
> 55629191) hashpartitioning(dateint#29, 500), coordinator[target post-shuffle 
> partition size: 67108864]
> {noformat}
> Broadcast hint is applied to Parquet table without partition. Below 
> "BroadcastHashJoin" is chosen as expected.
> {noformat}
> scala> spark.sql("CREATE TABLE jzhuge.parquet_no_part (val STRING, dateint 
> INT) STORED AS parquet")
> scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
> scala> Seq(spark.table("jzhuge.parquet_no_part")).map(df => 
> df.join(broadcast(df), "dateint").explain(true))
> == Parsed Logical Plan ==
> 'Join UsingJoin(Inner,List(dateint))
> :- SubqueryAlias `jzhuge`.`parquet_no_part`
> :  +- Relation[val#44,dateint#45] parquet
> +- ResolvedHint (broadcast)
>+- SubqueryAlias `jzhuge`.`parquet_no_part`
>   +- Relation[val#50,dateint#51] parquet
> == Analyzed Logical Plan ==
> dateint: int, val: string, val: string
> Project [dateint#45, val#44, val#50]
> +- Join Inner, (dateint#45 = dateint#51)
>:- SubqueryAlias `jzhuge`.`parquet_no_part`
>:  +- Relation[val#44,dateint#45] parquet
>+- ResolvedHint (broadcast)
>   +- SubqueryAlias `jzhuge`.`parquet_no_part`
>  +- Relation[val#50,dateint#51] parquet
> == Optimized Logical Plan ==
> Project [dateint#45, val#44, val#50]
> +- Join Inner, (dateint#45 = dateint#51)
>:- Filter isnotnull(dateint#45)
>:  +- Relation[val#44,dateint#45] parquet
>+- ResolvedHint (broadcast)
>   +- Filter isnotnull(dateint#51)
>  +- Relation[val#50,dateint#51] parquet
> == Physical Plan ==
> *(2) Project [dateint#45, val#44, val#50]
> +- *(2) BroadcastHashJoin [dateint#45], [dateint#51], Inner, BuildRight
>:- *(2) Project [val#44, dateint#45]
>:  +- *(2) Filter isnotnull(dateint#45)
>: +- *(2) FileScan parquet jzhuge.parquet_no_part[val#44,dateint#45] 
> Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], 
> PartitionFilters: [], PushedFilters: [IsNotNull(dateint)], ReadSchema: 
> struct
>+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[1, int, 
> true] as

[jira] [Assigned] (SPARK-26576) Broadcast hint not applied to partitioned table

2019-01-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26576:


Assignee: (was: Apache Spark)

> Broadcast hint not applied to partitioned table
> ---
>
> Key: SPARK-26576
> URL: https://issues.apache.org/jira/browse/SPARK-26576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: John Zhuge
>Priority: Major
>
> Broadcast hint is not applied to partitioned Parquet table. Below 
> "SortMergeJoin" is chosen incorrectly and "ResolvedHit(broadcast)" is removed 
> in Optimized Plan.
> {noformat}
> scala> spark.sql("CREATE TABLE jzhuge.parquet_with_part (val STRING) 
> PARTITIONED BY (dateint INT) STORED AS parquet")
> scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
> scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => 
> df.join(broadcast(df), "dateint").explain(true))
> == Parsed Logical Plan ==
> 'Join UsingJoin(Inner,List(dateint))
> :- SubqueryAlias `jzhuge`.`parquet_with_part`
> :  +- Relation[val#28,dateint#29] parquet
> +- ResolvedHint (broadcast)
>+- SubqueryAlias `jzhuge`.`parquet_with_part`
>   +- Relation[val#32,dateint#33] parquet
> == Analyzed Logical Plan ==
> dateint: int, val: string, val: string
> Project [dateint#29, val#28, val#32]
> +- Join Inner, (dateint#29 = dateint#33)
>:- SubqueryAlias `jzhuge`.`parquet_with_part`
>:  +- Relation[val#28,dateint#29] parquet
>+- ResolvedHint (broadcast)
>   +- SubqueryAlias `jzhuge`.`parquet_with_part`
>  +- Relation[val#32,dateint#33] parquet
> == Optimized Logical Plan ==
> Project [dateint#29, val#28, val#32]
> +- Join Inner, (dateint#29 = dateint#33)
>:- Project [val#28, dateint#29]
>:  +- Filter isnotnull(dateint#29)
>: +- Relation[val#28,dateint#29] parquet
>+- Project [val#32, dateint#33]
>   +- Filter isnotnull(dateint#33)
>  +- Relation[val#32,dateint#33] parquet
> == Physical Plan ==
> *(5) Project [dateint#29, val#28, val#32]
> +- *(5) SortMergeJoin [dateint#29], [dateint#33], Inner
>:- *(2) Sort [dateint#29 ASC NULLS FIRST], false, 0
>:  +- Exchange(coordinator id: 55629191) hashpartitioning(dateint#29, 
> 500), coordinator[target post-shuffle partition size: 67108864]
>: +- *(1) FileScan parquet jzhuge.parquet_with_part[val#28,dateint#29] 
> Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[], 
> PartitionCount: 0, PartitionFilters: [isnotnull(dateint#29)], PushedFilters: 
> [], ReadSchema: struct
>+- *(4) Sort [dateint#33 ASC NULLS FIRST], false, 0
>   +- ReusedExchange [val#32, dateint#33], Exchange(coordinator id: 
> 55629191) hashpartitioning(dateint#29, 500), coordinator[target post-shuffle 
> partition size: 67108864]
> {noformat}
> Broadcast hint is applied to Parquet table without partition. Below 
> "BroadcastHashJoin" is chosen as expected.
> {noformat}
> scala> spark.sql("CREATE TABLE jzhuge.parquet_no_part (val STRING, dateint 
> INT) STORED AS parquet")
> scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
> scala> Seq(spark.table("jzhuge.parquet_no_part")).map(df => 
> df.join(broadcast(df), "dateint").explain(true))
> == Parsed Logical Plan ==
> 'Join UsingJoin(Inner,List(dateint))
> :- SubqueryAlias `jzhuge`.`parquet_no_part`
> :  +- Relation[val#44,dateint#45] parquet
> +- ResolvedHint (broadcast)
>+- SubqueryAlias `jzhuge`.`parquet_no_part`
>   +- Relation[val#50,dateint#51] parquet
> == Analyzed Logical Plan ==
> dateint: int, val: string, val: string
> Project [dateint#45, val#44, val#50]
> +- Join Inner, (dateint#45 = dateint#51)
>:- SubqueryAlias `jzhuge`.`parquet_no_part`
>:  +- Relation[val#44,dateint#45] parquet
>+- ResolvedHint (broadcast)
>   +- SubqueryAlias `jzhuge`.`parquet_no_part`
>  +- Relation[val#50,dateint#51] parquet
> == Optimized Logical Plan ==
> Project [dateint#45, val#44, val#50]
> +- Join Inner, (dateint#45 = dateint#51)
>:- Filter isnotnull(dateint#45)
>:  +- Relation[val#44,dateint#45] parquet
>+- ResolvedHint (broadcast)
>   +- Filter isnotnull(dateint#51)
>  +- Relation[val#50,dateint#51] parquet
> == Physical Plan ==
> *(2) Project [dateint#45, val#44, val#50]
> +- *(2) BroadcastHashJoin [dateint#45], [dateint#51], Inner, BuildRight
>:- *(2) Project [val#44, dateint#45]
>:  +- *(2) Filter isnotnull(dateint#45)
>: +- *(2) FileScan parquet jzhuge.parquet_no_part[val#44,dateint#45] 
> Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], 
> PartitionFilters: [], PushedFilters: [IsNotNull(dateint)], ReadSchema: 
> struct
>+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[1, int, 
> true] as bigint)))
>   +- *(1)

67 matches

Mail list logo