date:20200213

[jira] [Updated] (SPARK-30827) Direct relationship is not documented in configurations in "spark.history.*" namespace

2020-02-13 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-30827:
-
Issue Type: Documentation  (was: Bug)

> Direct relationship is not documented in configurations in "spark.history.*" 
> namespace
> --
>
> Key: SPARK-30827
> URL: https://issues.apache.org/jira/browse/SPARK-30827
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> There are configurations under "spark.history" namespace which miss to 
> describe direct relationship between configurations. This issue tracks the 
> effort to fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30586) NPE in LiveRDDDistribution (AppStatusListener)

2020-02-13 Thread Saisai Shao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036734#comment-17036734
 ] 

Saisai Shao commented on SPARK-30586:
-

We also met the same issue. Seems like the code doesn't check the nullable of 
string and directly called String intern, which throws NPE from guava. My first 
thinking is to add nullable check in {{weakIntern}}. Still investigating how 
this could be happened, might be due to the lost or out-of-order spark listener 
event.

> NPE in LiveRDDDistribution (AppStatusListener)
> --
>
> Key: SPARK-30586
> URL: https://issues.apache.org/jira/browse/SPARK-30586
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: A Hadoop cluster consisting of Centos 7.4 machines.
>Reporter: Jan Van den bosch
>Priority: Major
>
> We've been noticing a great amount of NullPointerExceptions in our 
> long-running Spark job driver logs:
> {noformat}
> 20/01/17 23:40:12 ERROR AsyncEventQueue: Listener AppStatusListener threw an 
> exception
> java.lang.NullPointerException
> at 
> org.spark_project.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
> at 
> org.spark_project.guava.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507)
> at 
> org.spark_project.guava.collect.Interners$WeakInterner.intern(Interners.java:85)
> at 
> org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:603)
> at 
> org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:486)
> at 
> org.apache.spark.status.LiveRDD$$anonfun$2.apply(LiveEntity.scala:548)
> at 
> org.apache.spark.status.LiveRDD$$anonfun$2.apply(LiveEntity.scala:548)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:139)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:139)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:139)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:548)
> at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:49)
> at 
> org.apache.spark.status.AppStatusListener.org$apache$spark$status$AppStatusListener$$update(AppStatusListener.scala:991)
> at 
> org.apache.spark.status.AppStatusListener.org$apache$spark$status$AppStatusListener$$maybeUpdate(AppStatusListener.scala:997)
> at 
> org.apache.spark.status.AppStatusListener$$anonfun$onExecutorMetricsUpdate$2.apply(AppStatusListener.scala:764)
> at 
> org.apache.spark.status.AppStatusListener$$anonfun$onExecutorMetricsUpdate$2.apply(AppStatusListener.scala:764)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:139)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:139)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:139)
> at 
> org.apache.spark.status.AppStatusListener.org$apache$spark$status$AppStatusListener$$flush(AppStatusListener.scala:788)
> at 
> org.apache.spark.status.AppStatusListener.onExecutorMetricsUpdate(AppStatusListener.scala:764)
> at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:59)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
> at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)

[jira] [Comment Edited] (SPARK-30586) NPE in LiveRDDDistribution (AppStatusListener)

2020-02-13 Thread Saisai Shao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036734#comment-17036734
 ] 

Saisai Shao edited comment on SPARK-30586 at 2/14/20 7:13 AM:
--

We also met the same issue. Seems like the code doesn't check the nullable of 
string and directly called String intern, which throws NPE from guava. My first 
thinking is to add nullable check in {{weakIntern}}. Still investigating how 
this could be happened, might be due to the lost or out-of-order spark listener 
event.

CC [~vanzin]


was (Author: jerryshao):
We also met the same issue. Seems like the code doesn't check the nullable of 
string and directly called String intern, which throws NPE from guava. My first 
thinking is to add nullable check in {{weakIntern}}. Still investigating how 
this could be happened, might be due to the lost or out-of-order spark listener 
event.

> NPE in LiveRDDDistribution (AppStatusListener)
> --
>
> Key: SPARK-30586
> URL: https://issues.apache.org/jira/browse/SPARK-30586
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: A Hadoop cluster consisting of Centos 7.4 machines.
>Reporter: Jan Van den bosch
>Priority: Major
>
> We've been noticing a great amount of NullPointerExceptions in our 
> long-running Spark job driver logs:
> {noformat}
> 20/01/17 23:40:12 ERROR AsyncEventQueue: Listener AppStatusListener threw an 
> exception
> java.lang.NullPointerException
> at 
> org.spark_project.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
> at 
> org.spark_project.guava.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507)
> at 
> org.spark_project.guava.collect.Interners$WeakInterner.intern(Interners.java:85)
> at 
> org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:603)
> at 
> org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:486)
> at 
> org.apache.spark.status.LiveRDD$$anonfun$2.apply(LiveEntity.scala:548)
> at 
> org.apache.spark.status.LiveRDD$$anonfun$2.apply(LiveEntity.scala:548)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:139)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:139)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:139)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:548)
> at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:49)
> at 
> org.apache.spark.status.AppStatusListener.org$apache$spark$status$AppStatusListener$$update(AppStatusListener.scala:991)
> at 
> org.apache.spark.status.AppStatusListener.org$apache$spark$status$AppStatusListener$$maybeUpdate(AppStatusListener.scala:997)
> at 
> org.apache.spark.status.AppStatusListener$$anonfun$onExecutorMetricsUpdate$2.apply(AppStatusListener.scala:764)
> at 
> org.apache.spark.status.AppStatusListener$$anonfun$onExecutorMetricsUpdate$2.apply(AppStatusListener.scala:764)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:139)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:139)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:139)
> at 
> org.apache.spark.status.AppStatusListener.org$apache$spark$status$AppStatusListener$$flush(AppStatusListener.scala:788)
> at 
> org.apache.spark.status.AppStatusListener.onExecutorMetricsUpdate(AppStatusListener.scala:764)
> at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:59)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
> at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
> at 
> org.apache.spark.scheduler.As

[jira] [Created] (SPARK-30827) Direct relationship is not documented in configurations in "spark.history.*" namespace

2020-02-13 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-30827:


 Summary: Direct relationship is not documented in configurations 
in "spark.history.*" namespace
 Key: SPARK-30827
 URL: https://issues.apache.org/jira/browse/SPARK-30827
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


There are configurations under "spark.history" namespace which miss to describe 
direct relationship between configurations. This issue tracks the effort to fix 
it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30826) LIKE returns wrong result from external table using parquet

2020-02-13 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30826:
--

 Summary: LIKE returns wrong result from external table using 
parquet
 Key: SPARK-30826
 URL: https://issues.apache.org/jira/browse/SPARK-30826
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5, 3.0.0
Reporter: Maxim Gekk


# Write a parquet file with a column in upper case:
{code:scala}
Seq("42").toDF("COL").write.parquet(path)
{code}
# Create an external table on top of the written parquet files with a column in 
lower case
{code:sql}
CREATE TABLE t1 (col STRING)
USING parquet
OPTIONS (path '$path')
{code}
# Read the table using LIKE
{code:sql}
SELECT * FROM t1 WHERE col LIKE '4%'
{code}
It returns empty set but must be one row with 42.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30748) Storage Memory in Spark Web UI means

2020-02-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30748.
--
Resolution: Invalid

Please ask questions to mailing lists. See 
https://spark.apache.org/community.html

> Storage Memory in Spark Web UI means
> 
>
> Key: SPARK-30748
> URL: https://issues.apache.org/jira/browse/SPARK-30748
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: islandshinji
>Priority: Minor
>
> Does the denominator of 'Storage Memory' in Spark Web UI include execution 
> memory?
> In my environment, set 'spark.executor.memory' to 20g and the denominator of 
> 'Storage Memory' is 11.3g. I think it is too big just include storage memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30769) insertInto() with existing column as partition key cause weird partition result

2020-02-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036686#comment-17036686
 ] 

Hyukjin Kwon commented on SPARK-30769:
--

Please avoid to set Critical+ which is reserved for committers. Are you able to 
show a full and self-contained reproducer?

> insertInto() with existing column as partition key cause weird partition 
> result
> ---
>
> Key: SPARK-30769
> URL: https://issues.apache.org/jira/browse/SPARK-30769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: EMR 5.29.0 with Spark 2.4.4
>Reporter: Woong Seok Kang
>Priority: Major
>
> {code:java}
> val tableName = s"${config.service}_$saveDatabase.${config.table}_partitioned"
> val writer = TableWriter.getWriter(tableDF.withColumn(config.dateColumn, 
> typedLit[String](date.toString))) 
> if (xsc.tableExistIn(config.service, saveDatabase, 
> s"${config.table}_partitioned")) writer.insertInto(tableName)
> else writer.partitionBy(config.dateColumn).saveAsTable(tableName){code}
> This code checks whether table exists in desired path. (somewhere in S3 in 
> this case) If table already exists in path then insert a new partition with 
> insertInto() function.
> If config.dateColumn not exists in table schema, no problem occurred. (just 
> new column will be added) but if it is already exists in schema, Spark does 
> not use given column as a partition key, instead it will create a hundred of 
> partitions. Below is a part of Spark logs:
> (Note that the name of partition column is date_ymd, which is already exists 
> in source table. original value is a date string like '2020-01-01')
> 20/02/10 05:33:01 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=174
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=174
> 20/02/10 05:33:02 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=62
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=62
> 20/02/10 05:33:02 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=83
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=83
> 20/02/10 05:33:03 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=231
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=231
> 20/02/10 05:33:03 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=268
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=268
> 20/02/10 05:33:04 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=33
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=33
> 20/02/10 05:33:05 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=40
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=40
> rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=__HIVE_DEFAULT_PARTITION__
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=__HIVE_DEFAULT_PARTITION__
> When I use different partition key which not in table schema such as 
> 'stamp_date', everything goes fine. I'm not sure that it is a Spark bugs, I 
> just wrote the report. (I think it is related with Hive...)
> Thanks for reading!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30774) The default checkpointing interval is not as claimed in the comment.

2020-02-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036684#comment-17036684
 ] 

Hyukjin Kwon commented on SPARK-30774:
--

Can you show the reproducible codes to show the comment is broken?

> The default checkpointing interval is not as claimed in the comment.
> 
>
> Key: SPARK-30774
> URL: https://issues.apache.org/jira/browse/SPARK-30774
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.5
>Reporter: Kyle Krueger
>Priority: Minor
>
> [https://github.com/apache/spark/blob/71737861531180bbda9aec8d241b1428fe91cab2/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala#L199-L203MajorMajor]
> -The checkpoint duration is set to be the window duration, maybe the idea in 
> the old comment wanting to set to the higher of 10s or window-size is no 
> longer relevant.-
> -I propose we either adapt the comment to just say to just say that we set 
> the checkpoint duration to the window size and clean up how that value is 
> set, or we change the code to do as the comment remarks.-
>  
> So, the original statement I made was wrong. This code is still broken 
> though. Consider the case where window duration is 3, the result would be a 
> checkpoint size of 12s. That doesn't correspond to the rule implied by the 
> comment and is thus unexpected behaviour.
> This code does however result in the checkpoint size being a multiple of the 
> slide duration, which is safe as far as I know.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30769) insertInto() with existing column as partition key cause weird partition result

2020-02-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30769:
-
Priority: Major  (was: Blocker)

> insertInto() with existing column as partition key cause weird partition 
> result
> ---
>
> Key: SPARK-30769
> URL: https://issues.apache.org/jira/browse/SPARK-30769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: EMR 5.29.0 with Spark 2.4.4
>Reporter: Woong Seok Kang
>Priority: Major
>
> {code:java}
> val tableName = s"${config.service}_$saveDatabase.${config.table}_partitioned"
> val writer = TableWriter.getWriter(tableDF.withColumn(config.dateColumn, 
> typedLit[String](date.toString))) 
> if (xsc.tableExistIn(config.service, saveDatabase, 
> s"${config.table}_partitioned")) writer.insertInto(tableName)
> else writer.partitionBy(config.dateColumn).saveAsTable(tableName){code}
> This code checks whether table exists in desired path. (somewhere in S3 in 
> this case) If table already exists in path then insert a new partition with 
> insertInto() function.
> If config.dateColumn not exists in table schema, no problem occurred. (just 
> new column will be added) but if it is already exists in schema, Spark does 
> not use given column as a partition key, instead it will create a hundred of 
> partitions. Below is a part of Spark logs:
> (Note that the name of partition column is date_ymd, which is already exists 
> in source table. original value is a date string like '2020-01-01')
> 20/02/10 05:33:01 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=174
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=174
> 20/02/10 05:33:02 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=62
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=62
> 20/02/10 05:33:02 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=83
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=83
> 20/02/10 05:33:03 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=231
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=231
> 20/02/10 05:33:03 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=268
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=268
> 20/02/10 05:33:04 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=33
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=33
> 20/02/10 05:33:05 INFO S3NativeFileSystem2: rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=40
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=40
> rename 
> s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=__HIVE_DEFAULT_PARTITION__
>  s3://\{my_path_at_s3}_partitioned_test/date_ymd=__HIVE_DEFAULT_PARTITION__
> When I use different partition key which not in table schema such as 
> 'stamp_date', everything goes fine. I'm not sure that it is a Spark bugs, I 
> just wrote the report. (I think it is related with Hive...)
> Thanks for reading!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30781) Missing SortedMap type in pyspark

2020-02-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30781.
--
Resolution: Won't Fix

> Missing SortedMap type in pyspark
> -
>
> Key: SPARK-30781
> URL: https://issues.apache.org/jira/browse/SPARK-30781
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.5
>Reporter: Mateusz User
>Priority: Major
>  Labels: features
>
> Currently there is only MapType in pyspark API, which does not keep order of 
> key-value map.
>  
> *SortedMapType* would fill this gap: a map with sorted key-value pairs (like 
> TreeMap in Java).
>  
> For example:
> *SortedMapType* would be very useful when user wants to persist row from 
> DataFrame into Mongo-db.
> row which consist a column with values of an ordered map type:
> col : [1 -> 22, 2 -> 16, 3 -> 25]
> will be persisted as following json:
> {   "1": 22,   "2": 16,   "3": 25 }
>  
> instead of MapType which currently results in:
> {   "2": 16,   "1": 22,   "3": 25 }
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30781) Missing SortedMap type in pyspark

2020-02-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036683#comment-17036683
 ] 

Hyukjin Kwon commented on SPARK-30781:
--

Currently we don't have the plan to add this given that we now have many higher 
order functions to workaround this problem,
and the dev overhead by adding a new type - we should consider how it works in 
Scala, Java, R, Python and SQL.

> Missing SortedMap type in pyspark
> -
>
> Key: SPARK-30781
> URL: https://issues.apache.org/jira/browse/SPARK-30781
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.5
>Reporter: Mateusz User
>Priority: Major
>  Labels: features
>
> Currently there is only MapType in pyspark API, which does not keep order of 
> key-value map.
>  
> *SortedMapType* would fill this gap: a map with sorted key-value pairs (like 
> TreeMap in Java).
>  
> For example:
> *SortedMapType* would be very useful when user wants to persist row from 
> DataFrame into Mongo-db.
> row which consist a column with values of an ordered map type:
> col : [1 -> 22, 2 -> 16, 3 -> 25]
> will be persisted as following json:
> {   "1": 22,   "2": 16,   "3": 25 }
>  
> instead of MapType which currently results in:
> {   "2": 16,   "1": 22,   "3": 25 }
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30787) Add Generic Algorithm optimizer feature to spark-ml

2020-02-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036681#comment-17036681
 ] 

Hyukjin Kwon commented on SPARK-30787:
--

Can you ask it to the dev mailing list?

> Add Generic Algorithm optimizer feature to spark-ml
> ---
>
> Key: SPARK-30787
> URL: https://issues.apache.org/jira/browse/SPARK-30787
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.4.5
>Reporter: louischoi
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Hi. 
> It seems that spark does not have Generic Algoritm Optimizer.
> I think that this algorithm fit well in distributed system like spark.
> It is aimed to solve problems like Traveling Salesman Problem,graph 
> partitioning, Optimizing Network topology ... etc
>  
> Is there some reason that Spark does not include this feature?
>  
> Can i work on this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30805) Failed to get locally stored broadcast data: broadcast_30

2020-02-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30805.
--
Resolution: Incomplete

> Failed to get locally stored broadcast data: broadcast_30
> -
>
> Key: SPARK-30805
> URL: https://issues.apache.org/jira/browse/SPARK-30805
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: spark 2.4.4
>Reporter: Jiasi
>Priority: Major
>
> the stack trace is below:
>  
> {quote}20/02/08 04:56:30 ERROR Utils: Exception 
> encounteredjava.io.EOFExceptionat 
> java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:3049)at
>  java.io.ObjectInputStream.readFully(ObjectInputStream.java:1084)at 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation$$anonfun$readExternal$1$$anonfun$apply$mcV$sp$11.apply(HashedRelation.scala:216)at
>  
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation$$anonfun$readExternal$1$$anonfun$apply$mcV$sp$11.apply(HashedRelation.scala:216)at
>  
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation.org$apache$spark$sql$execution$joins$UnsafeHashedRelation$$read(HashedRelation.scala:259)at
>  
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation$$anonfun$readExternal$1.apply$mcV$sp(HashedRelation.scala:216)at
>  
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation$$anonfun$readExternal$1.apply(HashedRelation.scala:216)at
>  
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation$$anonfun$readExternal$1.apply(HashedRelation.scala:216)at
>  org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)at 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation.readExternal(HashedRelation.scala:215)at
>  java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:2062)at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2011)at 
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)at 
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)at
>  
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168)at
>  org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)at
>  
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)at
>  
> org.apache.spark.storage.BlockManager.maybeCacheDiskValuesInMemory(BlockManager.scala:1312)at
>  
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:612)at
>  
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:214)at
>  scala.Option.getOrElse(Option.scala:121)at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)at
>  org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)at
>  
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)at
>  
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)at
>  
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)at
>  org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:119)at
>  
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:259)at
>  
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:102)at
>  
> org.apache.spark.sql.execution.CodegenSupport$class.constructDoConsumeFunction(WholeStageCodegenExec.scala:216)at
>  
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:187)at
>  
> org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:85)at
>  
> org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:206)at
>  
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:189)at
>  
> org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:374)at
>  
> org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:403)at
>  
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:90)at
>  
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)at
>  
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.sc

[jira] [Commented] (SPARK-30805) Failed to get locally stored broadcast data: broadcast_30

2020-02-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036680#comment-17036680
 ] 

Hyukjin Kwon commented on SPARK-30805:
--

Firstly, please avoid to set Critical+ which is reserved for committers. Seems 
like impossible to investigate or reproduce only given that the stacktrace.
I am leaving this resolved until anyone can reproduce.

> Failed to get locally stored broadcast data: broadcast_30
> -
>
> Key: SPARK-30805
> URL: https://issues.apache.org/jira/browse/SPARK-30805
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: spark 2.4.4
>Reporter: Jiasi
>Priority: Major
>
> the stack trace is below:
>  
> {quote}20/02/08 04:56:30 ERROR Utils: Exception 
> encounteredjava.io.EOFExceptionat 
> java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:3049)at
>  java.io.ObjectInputStream.readFully(ObjectInputStream.java:1084)at 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation$$anonfun$readExternal$1$$anonfun$apply$mcV$sp$11.apply(HashedRelation.scala:216)at
>  
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation$$anonfun$readExternal$1$$anonfun$apply$mcV$sp$11.apply(HashedRelation.scala:216)at
>  
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation.org$apache$spark$sql$execution$joins$UnsafeHashedRelation$$read(HashedRelation.scala:259)at
>  
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation$$anonfun$readExternal$1.apply$mcV$sp(HashedRelation.scala:216)at
>  
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation$$anonfun$readExternal$1.apply(HashedRelation.scala:216)at
>  
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation$$anonfun$readExternal$1.apply(HashedRelation.scala:216)at
>  org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)at 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation.readExternal(HashedRelation.scala:215)at
>  java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:2062)at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2011)at 
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)at 
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)at
>  
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168)at
>  org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)at
>  
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)at
>  
> org.apache.spark.storage.BlockManager.maybeCacheDiskValuesInMemory(BlockManager.scala:1312)at
>  
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:612)at
>  
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:214)at
>  scala.Option.getOrElse(Option.scala:121)at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)at
>  org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)at
>  
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)at
>  
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)at
>  
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)at
>  org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:119)at
>  
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:259)at
>  
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:102)at
>  
> org.apache.spark.sql.execution.CodegenSupport$class.constructDoConsumeFunction(WholeStageCodegenExec.scala:216)at
>  
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:187)at
>  
> org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:85)at
>  
> org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:206)at
>  
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:189)at
>  
> org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:374)at
>  
> org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:403)at
>  
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.

[jira] [Updated] (SPARK-30805) Failed to get locally stored broadcast data: broadcast_30

2020-02-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30805:
-
Priority: Major  (was: Blocker)

> Failed to get locally stored broadcast data: broadcast_30
> -
>
> Key: SPARK-30805
> URL: https://issues.apache.org/jira/browse/SPARK-30805
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: spark 2.4.4
>Reporter: Jiasi
>Priority: Major
>
> the stack trace is below:
>  
> {quote}20/02/08 04:56:30 ERROR Utils: Exception 
> encounteredjava.io.EOFExceptionat 
> java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:3049)at
>  java.io.ObjectInputStream.readFully(ObjectInputStream.java:1084)at 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation$$anonfun$readExternal$1$$anonfun$apply$mcV$sp$11.apply(HashedRelation.scala:216)at
>  
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation$$anonfun$readExternal$1$$anonfun$apply$mcV$sp$11.apply(HashedRelation.scala:216)at
>  
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation.org$apache$spark$sql$execution$joins$UnsafeHashedRelation$$read(HashedRelation.scala:259)at
>  
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation$$anonfun$readExternal$1.apply$mcV$sp(HashedRelation.scala:216)at
>  
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation$$anonfun$readExternal$1.apply(HashedRelation.scala:216)at
>  
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation$$anonfun$readExternal$1.apply(HashedRelation.scala:216)at
>  org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)at 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation.readExternal(HashedRelation.scala:215)at
>  java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:2062)at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2011)at 
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)at 
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)at
>  
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:168)at
>  org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)at
>  
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)at
>  
> org.apache.spark.storage.BlockManager.maybeCacheDiskValuesInMemory(BlockManager.scala:1312)at
>  
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:612)at
>  
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:214)at
>  scala.Option.getOrElse(Option.scala:121)at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)at
>  org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)at
>  
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)at
>  
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)at
>  
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)at
>  org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:119)at
>  
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:259)at
>  
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:102)at
>  
> org.apache.spark.sql.execution.CodegenSupport$class.constructDoConsumeFunction(WholeStageCodegenExec.scala:216)at
>  
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:187)at
>  
> org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:85)at
>  
> org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:206)at
>  
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:189)at
>  
> org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:374)at
>  
> org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:403)at
>  
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:90)at
>  
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)at
>  
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(Spark

[jira] [Assigned] (SPARK-30801) Subqueries should not be AQE-ed if main query is not

2020-02-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30801:
---

Assignee: Wei Xue

> Subqueries should not be AQE-ed if main query is not
> 
>
> Key: SPARK-30801
> URL: https://issues.apache.org/jira/browse/SPARK-30801
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
>
> Currently there are unsupported queries by AQE, e.g., queries that contain 
> DPP filters. But if the main query is unsupported while the sub-query is, the 
> subquery itself will be AQE-ed, which can lead to performance regressions due 
> to missed opportunity of sub-query reuse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30801) Subqueries should not be AQE-ed if main query is not

2020-02-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30801.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27554
[https://github.com/apache/spark/pull/27554]

> Subqueries should not be AQE-ed if main query is not
> 
>
> Key: SPARK-30801
> URL: https://issues.apache.org/jira/browse/SPARK-30801
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently there are unsupported queries by AQE, e.g., queries that contain 
> DPP filters. But if the main query is unsupported while the sub-query is, the 
> subquery itself will be AQE-ed, which can lead to performance regressions due 
> to missed opportunity of sub-query reuse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30799) "spark_catalog.t" should not be resolved to temp view

2020-02-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30799:

Summary: "spark_catalog.t" should not be resolved to temp view  (was: 
CatalogAndIdentifier shouldn't return wrong namespace)

> "spark_catalog.t" should not be resolved to temp view
> -
>
> Key: SPARK-30799
> URL: https://issues.apache.org/jira/browse/SPARK-30799
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30825) Add documents information for window function.

2020-02-13 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30825:
---
Summary: Add documents information for window function.  (was: Add since 
information for window function.)

> Add documents information for window function.
> --
>
> Key: SPARK-30825
> URL: https://issues.apache.org/jira/browse/SPARK-30825
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> All the window function have not start version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30825) Add since information for window function.

2020-02-13 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036634#comment-17036634
 ] 

jiaan.geng commented on SPARK-30825:


I'm working on.

> Add since information for window function.
> --
>
> Key: SPARK-30825
> URL: https://issues.apache.org/jira/browse/SPARK-30825
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> All the window function have not start version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30825) Add since information for window function.

2020-02-13 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-30825:
--

 Summary: Add since information for window function.
 Key: SPARK-30825
 URL: https://issues.apache.org/jira/browse/SPARK-30825
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: jiaan.geng


All the window function have not start version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30824) Support submit sql content only contains comments.

2020-02-13 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-30824:
--

 Summary: Support submit sql content only contains comments.
 Key: SPARK-30824
 URL: https://issues.apache.org/jira/browse/SPARK-30824
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: jiaan.geng


Spark SQL cannot accept input comments as sql.

Postegresql could accept comments only.

We may need to resolve this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30667) Support simple all gather in barrier task context

2020-02-13 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-30667.
---
Resolution: Fixed

> Support simple all gather in barrier task context
> -
>
> Key: SPARK-30667
> URL: https://issues.apache.org/jira/browse/SPARK-30667
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Sarth Frey
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
> can see all IP addresses from BarrierTaskContext. It would be simpler to 
> integrate with distributed frameworks like TensorFlow DistributionStrategy if 
> we provide all gather that can let tasks share additional information with 
> others, e.g., an available port.
> Note that with all gather, tasks are share their IP addresses as well.
> {code}
> port = ... # get an available port
> ports = context.all_gather(port) # get all available ports, ordered by task ID
> ...  # set up distributed training service
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30667) Support simple all gather in barrier task context

2020-02-13 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-30667:
-

Assignee: Sarth Frey

> Support simple all gather in barrier task context
> -
>
> Key: SPARK-30667
> URL: https://issues.apache.org/jira/browse/SPARK-30667
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Sarth Frey
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
> can see all IP addresses from BarrierTaskContext. It would be simpler to 
> integrate with distributed frameworks like TensorFlow DistributionStrategy if 
> we provide all gather that can let tasks share additional information with 
> others, e.g., an available port.
> Note that with all gather, tasks are share their IP addresses as well.
> {code}
> port = ... # get an available port
> ports = context.all_gather(port) # get all available ports, ordered by task ID
> ...  # set up distributed training service
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30667) Support simple all gather in barrier task context

2020-02-13 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-30667:
--
Target Version/s: 3.0.0  (was: 3.1.0)

> Support simple all gather in barrier task context
> -
>
> Key: SPARK-30667
> URL: https://issues.apache.org/jira/browse/SPARK-30667
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
> can see all IP addresses from BarrierTaskContext. It would be simpler to 
> integrate with distributed frameworks like TensorFlow DistributionStrategy if 
> we provide all gather that can let tasks share additional information with 
> others, e.g., an available port.
> Note that with all gather, tasks are share their IP addresses as well.
> {code}
> port = ... # get an available port
> ports = context.all_gather(port) # get all available ports, ordered by task ID
> ...  # set up distributed training service
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30667) Support simple all gather in barrier task context

2020-02-13 Thread Xiangrui Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-30667:
--
Fix Version/s: 3.0.0

> Support simple all gather in barrier task context
> -
>
> Key: SPARK-30667
> URL: https://issues.apache.org/jira/browse/SPARK-30667
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
> can see all IP addresses from BarrierTaskContext. It would be simpler to 
> integrate with distributed frameworks like TensorFlow DistributionStrategy if 
> we provide all gather that can let tasks share additional information with 
> others, e.g., an available port.
> Note that with all gather, tasks are share their IP addresses as well.
> {code}
> port = ... # get an available port
> ports = context.all_gather(port) # get all available ports, ordered by task ID
> ...  # set up distributed training service
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30798) Scope Session.active in QueryExecution

2020-02-13 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-30798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-30798.
---
Fix Version/s: 3.0.0
 Assignee: Ali Afroozeh
   Resolution: Fixed

> Scope Session.active in QueryExecution
> --
>
> Key: SPARK-30798
> URL: https://issues.apache.org/jira/browse/SPARK-30798
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Assignee: Ali Afroozeh
>Priority: Major
> Fix For: 3.0.0
>
>
> SparkSession.active is a thread local variable that points to the current 
> thread's spark session. It is important to note that the SQLConf.get method 
> depends on SparkSession.active. In the current implementation it is possible 
> that SparkSession.active points to a different session which causes various 
> problems. Most of these problems arise because part of the query processing 
> is done using the configurations of a different session. For example, when 
> creating a data frame using a new session, i.e., session.sql("..."), part of 
> the data frame is constructed using the currently active spark session, which 
> can be a different session from the one used later for processing the query.
> This PR scopes SparkSession.active to prevent the above-mentioned problems. A 
> new method, withActive is introduced on SparkSession that restores the 
> previous spark session after the block of code is executed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30823) %PYTHONPATH% not set in python/docs/make2.bat, resulting in failed/wrong documentation builds

2020-02-13 Thread David Toneian (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Toneian updated SPARK-30823:
--
Description: 
When building the PySpark documentation on Windows, by changing directory to 
{{python/docs}} and running {{make.bat}} (which runs {{make2.bat}}), the 
majority of the documentation may not be built if {{pyspark}} is not in the 
default {{%PYTHONPATH%}}. Sphinx then reports that {{pyspark}} (and possibly 
dependencies) cannot be imported.

If {{pyspark}} is in the default {{%PYTHONPATH%}}, I suppose it is that version 
of {{pyspark}} – as opposed to the version found above the {{python/docs}} 
directory – that is considered when building the documentation, which may 
result in documentation that does not correspond to the development version one 
is trying to build.

{{python/docs/Makefile}} avoids this issue by setting
 ??export PYTHONPATH=$(realpath ..):$(realpath ../lib/py4j-0.10.8.1-src.zip)??
 on line 10, but {{make2.bat}} does no such thing. The fix consist of adding
 ??set PYTHONPATH=..;..\lib\py4j-0.10.8.1-src.zip??
 to {{make2.bat}}.

See [GitHub PR #27569|https://github.com/apache/spark/pull/27569].

  was:
When building the PySpark documentation on Windows, by changing directory to 
{{python/docs}} and running {{make.bat}} (which runs {{make2.bat}}), the 
majority of the documentation may not be built if {{pyspark}} is not in the 
default {{%PYTHONPATH%}}. Sphinx then reports that {{pyspark}} (and possibly 
dependencies) cannot be imported.

If {{pyspark}} is in the default {{%PYTHONPATH%}}, I suppose it is that version 
of {{pyspark}} – as opposed to the version found above the {{python/docs}} 
directory – that is considered when building the documentation, which may 
result in documentation that does not correspond to the development version one 
is trying to build.

{{python/docs/Makefile}} avoids this issue by setting
 ??export PYTHONPATH=$(realpath ..):$(realpath ../lib/py4j-0.10.8.1-src.zip)??
 on line 10, but {{make2.bat}} does no such thing. The fix consist of adding
 ??set PYTHONPATH=..;..\lib\py4j-0.10.8.1-src.zip??
 to {{make2.bat}}.

I will open a GitHub PR shortly.


> %PYTHONPATH% not set in python/docs/make2.bat, resulting in failed/wrong 
> documentation builds
> -
>
> Key: SPARK-30823
> URL: https://issues.apache.org/jira/browse/SPARK-30823
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark, Windows
>Affects Versions: 2.4.5
> Environment: Tested on Windows 10.
>Reporter: David Toneian
>Priority: Minor
>
> When building the PySpark documentation on Windows, by changing directory to 
> {{python/docs}} and running {{make.bat}} (which runs {{make2.bat}}), the 
> majority of the documentation may not be built if {{pyspark}} is not in the 
> default {{%PYTHONPATH%}}. Sphinx then reports that {{pyspark}} (and possibly 
> dependencies) cannot be imported.
> If {{pyspark}} is in the default {{%PYTHONPATH%}}, I suppose it is that 
> version of {{pyspark}} – as opposed to the version found above the 
> {{python/docs}} directory – that is considered when building the 
> documentation, which may result in documentation that does not correspond to 
> the development version one is trying to build.
> {{python/docs/Makefile}} avoids this issue by setting
>  ??export PYTHONPATH=$(realpath ..):$(realpath ../lib/py4j-0.10.8.1-src.zip)??
>  on line 10, but {{make2.bat}} does no such thing. The fix consist of adding
>  ??set PYTHONPATH=..;..\lib\py4j-0.10.8.1-src.zip??
>  to {{make2.bat}}.
> See [GitHub PR #27569|https://github.com/apache/spark/pull/27569].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30823) %PYTHONPATH% not set in python/docs/make2.bat, resulting in failed/wrong documentation builds

2020-02-13 Thread David Toneian (Jira)

David Toneian created SPARK-30823:
-

 Summary: %PYTHONPATH% not set in python/docs/make2.bat, resulting 
in failed/wrong documentation builds
 Key: SPARK-30823
 URL: https://issues.apache.org/jira/browse/SPARK-30823
 Project: Spark
  Issue Type: Bug
  Components: Documentation, PySpark, Windows
Affects Versions: 2.4.5
 Environment: Tested on Windows 10.
Reporter: David Toneian


When building the PySpark documentation on Windows, by changing directory to 
{{python/docs}} and running {{make.bat}} (which runs {{make2.bat}}), the 
majority of the documentation may not be built if {{pyspark}} is not in the 
default {{%PYTHONPATH%}}. Sphinx then reports that {{pyspark}} (and possibly 
dependencies) cannot be imported.

If {{pyspark}} is in the default {{%PYTHONPATH%}}, I suppose it is that version 
of {{pyspark}} – as opposed to the version found above the {{python/docs}} 
directory – that is considered when building the documentation, which may 
result in documentation that does not correspond to the development version one 
is trying to build.

{{python/docs/Makefile}} avoids this issue by setting
 ??export PYTHONPATH=$(realpath ..):$(realpath ../lib/py4j-0.10.8.1-src.zip)??
 on line 10, but {{make2.bat}} does no such thing. The fix consist of adding
 ??set PYTHONPATH=..;..\lib\py4j-0.10.8.1-src.zip??
 to {{make2.bat}}.

I will open a GitHub PR shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30821) Executor pods with multiple containers will not be rescheduled unless all containers fail

2020-02-13 Thread Kevin Hogeland (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Hogeland updated SPARK-30821:
---
Description: Since the restart policy of launched pods is Never, additional 
handling is required for pods that may have sidecar containers. The executor 
should be considered failed if any containers have terminated and have a 
non-zero exit code, but Spark currently only checks the pod phase. The pod 
phase will remain "running" as long as _any_ pods are still running. Kubernetes 
sidecar support in 1.18/1.19 does not address this situation, as sidecar 
containers are excluded from pod phase calculation.  (was: Since the restart 
policy of launched pods is Never, additional handling is required for pods that 
may have sidecar containers that need to restart on failure. Kubernetes sidecar 
support in 1.18/1.19 does _not_ address this situation (unlike 
[SPARK-28887|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28887]),
 as sidecar containers are excluded from pod phase calculation.

The pod snapshot should be considered "PodFailed" if the restart policy is 
Never and any container has a non-zero exit code.

(This is arguably a duplicate of SPARK-28887, but that issue is specifically 
for when the executor process fails))

> Executor pods with multiple containers will not be rescheduled unless all 
> containers fail
> -
>
> Key: SPARK-30821
> URL: https://issues.apache.org/jira/browse/SPARK-30821
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Kevin Hogeland
>Priority: Major
>
> Since the restart policy of launched pods is Never, additional handling is 
> required for pods that may have sidecar containers. The executor should be 
> considered failed if any containers have terminated and have a non-zero exit 
> code, but Spark currently only checks the pod phase. The pod phase will 
> remain "running" as long as _any_ pods are still running. Kubernetes sidecar 
> support in 1.18/1.19 does not address this situation, as sidecar containers 
> are excluded from pod phase calculation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30821) Executor pods with multiple containers will not be rescheduled unless all containers fail

2020-02-13 Thread Kevin Hogeland (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Hogeland updated SPARK-30821:
---
Summary: Executor pods with multiple containers will not be rescheduled 
unless all containers fail  (was: Sidecar containers in executor/driver may 
fail silently)

> Executor pods with multiple containers will not be rescheduled unless all 
> containers fail
> -
>
> Key: SPARK-30821
> URL: https://issues.apache.org/jira/browse/SPARK-30821
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Kevin Hogeland
>Priority: Major
>
> Since the restart policy of launched pods is Never, additional handling is 
> required for pods that may have sidecar containers that need to restart on 
> failure. Kubernetes sidecar support in 1.18/1.19 does _not_ address this 
> situation (unlike 
> [SPARK-28887|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28887]),
>  as sidecar containers are excluded from pod phase calculation.
> The pod snapshot should be considered "PodFailed" if the restart policy is 
> Never and any container has a non-zero exit code.
> (This is arguably a duplicate of SPARK-28887, but that issue is specifically 
> for when the executor process fails)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30821) Sidecar containers in executor/driver may fail silently

2020-02-13 Thread Kevin Hogeland (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Hogeland updated SPARK-30821:
---
Description: 
Since the restart policy of launched pods is Never, additional handling is 
required for pods that may have sidecar containers that need to restart on 
failure. Kubernetes sidecar support in 1.18/1.19 does _not_ address this 
situation (unlike 
[SPARK-28887|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28887]),
 as sidecar containers are excluded from pod phase calculation.

The pod snapshot should be considered "PodFailed" if the restart policy is 
Never and any container has a non-zero exit code.

(This is arguably a duplicate of SPARK-28887, but that issue is specifically 
for when the executor process fails)

  was:
Since the restart policy of launched pods is Never, additional handling is 
required for pods that may have sidecar containers that need to restart on 
failure. Kubernetes sidecar support in 1.18/1.19 does _not_ address this 
situation (unlike 
[SPARK-28887|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28887]),
 as sidecar containers are excluded from pod phase calculation.

The pod snapshot should be considered "PodFailed" if the restart policy is 
Never and any container has a non-zero exit code.


> Sidecar containers in executor/driver may fail silently
> ---
>
> Key: SPARK-30821
> URL: https://issues.apache.org/jira/browse/SPARK-30821
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Kevin Hogeland
>Priority: Major
>
> Since the restart policy of launched pods is Never, additional handling is 
> required for pods that may have sidecar containers that need to restart on 
> failure. Kubernetes sidecar support in 1.18/1.19 does _not_ address this 
> situation (unlike 
> [SPARK-28887|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28887]),
>  as sidecar containers are excluded from pod phase calculation.
> The pod snapshot should be considered "PodFailed" if the restart policy is 
> Never and any container has a non-zero exit code.
> (This is arguably a duplicate of SPARK-28887, but that issue is specifically 
> for when the executor process fails)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30822) Pyspark queries fail if terminated with a semicolon

2020-02-13 Thread Samuel Setegne (Jira)

Samuel Setegne created SPARK-30822:
--

 Summary: Pyspark queries fail if terminated with a semicolon
 Key: SPARK-30822
 URL: https://issues.apache.org/jira/browse/SPARK-30822
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Samuel Setegne


When a user submits a directly executable SQL statement terminated with a 
semicolon, they receive a `org.apache.spark.sql.catalyst.parser.ParseException` 
of `mismatched input ";"`. SQL-92 describes a direct SQL statement as having 
the format of ` ` and the majority of 
SQL implementations either require the semicolon as a statement terminator, or 
make it optional (meaning not raising an exception when it's included, 
seemingly in recognition that it's a common behavior).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30821) Sidecar containers in executor/driver may fail silently

2020-02-13 Thread Kevin Hogeland (Jira)

Kevin Hogeland created SPARK-30821:
--

 Summary: Sidecar containers in executor/driver may fail silently
 Key: SPARK-30821
 URL: https://issues.apache.org/jira/browse/SPARK-30821
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.1.0
Reporter: Kevin Hogeland


Since the restart policy of launched pods is Never, additional handling is 
required for pods that may have sidecar containers that need to restart on 
failure. Kubernetes sidecar support in 1.18/1.19 does _not_ address this 
situation (unlike 
[SPARK-28887|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28887]),
 as sidecar containers are excluded from pod phase calculation.

The pod snapshot should be considered "PodFailed" if the restart policy is 
Never and any container has a non-zero exit code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30816) Fix dev-run-integration-tests.sh to ignore empty param correctly

2020-02-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30816.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27566
[https://github.com/apache/spark/pull/27566]

> Fix dev-run-integration-tests.sh to ignore empty param correctly
> 
>
> Key: SPARK-30816
> URL: https://issues.apache.org/jira/browse/SPARK-30816
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Trivial
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30816) Fix dev-run-integration-tests.sh to ignore empty param correctly

2020-02-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30816:
-

Assignee: Dongjoon Hyun

> Fix dev-run-integration-tests.sh to ignore empty param correctly
> 
>
> Key: SPARK-30816
> URL: https://issues.apache.org/jira/browse/SPARK-30816
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30820) Add FMClassifier to SparkR

2020-02-13 Thread Maciej Szymkiewicz (Jira)

Maciej Szymkiewicz created SPARK-30820:
--

 Summary: Add FMClassifier to SparkR
 Key: SPARK-30820
 URL: https://issues.apache.org/jira/browse/SPARK-30820
 Project: Spark
  Issue Type: Sub-task
  Components: ML, SparkR
Affects Versions: 3.0.0, 3.1.0
Reporter: Maciej Szymkiewicz


Spark should provide a wrapper for {{o.a.s.ml.classification. FMClassifier}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30819) Add FMRegressor wrapper to SparkR

2020-02-13 Thread Maciej Szymkiewicz (Jira)

Maciej Szymkiewicz created SPARK-30819:
--

 Summary: Add FMRegressor wrapper to SparkR
 Key: SPARK-30819
 URL: https://issues.apache.org/jira/browse/SPARK-30819
 Project: Spark
  Issue Type: Sub-task
  Components: ML, SparkR
Affects Versions: 3.0.0, 3.1.0
Reporter: Maciej Szymkiewicz


Spark should provide a wrapper for {{o.a.s.ml.regression. FMRegressor}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30807) Support JDK11 in K8S integration test

2020-02-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30807.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27559
[https://github.com/apache/spark/pull/27559]

> Support JDK11 in K8S integration test
> -
>
> Key: SPARK-30807
> URL: https://issues.apache.org/jira/browse/SPARK-30807
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30818) Add LinearRegression wrapper to SparkR

2020-02-13 Thread Maciej Szymkiewicz (Jira)

Maciej Szymkiewicz created SPARK-30818:
--

 Summary: Add LinearRegression wrapper to SparkR
 Key: SPARK-30818
 URL: https://issues.apache.org/jira/browse/SPARK-30818
 Project: Spark
  Issue Type: Sub-task
  Components: ML, SparkR
Affects Versions: 3.0.0, 3.1.0
Reporter: Maciej Szymkiewicz


Spark should provide a wrapper for {{o.a.s.ml.regression. LinearRegression}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30817) SparkR ML algorithms parity

2020-02-13 Thread Maciej Szymkiewicz (Jira)

Maciej Szymkiewicz created SPARK-30817:
--

 Summary: SparkR ML algorithms parity 
 Key: SPARK-30817
 URL: https://issues.apache.org/jira/browse/SPARK-30817
 Project: Spark
  Issue Type: Umbrella
  Components: ML, SparkR
Affects Versions: 3.0.0, 3.1.0
Reporter: Maciej Szymkiewicz


As of 3.0 the following algorithms are missing form SparkR

* {{LinearRegression}} 
* {{FMRegressor}} (Added to ML in 3.0)
* {{FMClassifier}} (Added to ML in 3.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30751) Combine the skewed readers into one in AQE skew join optimizations

2020-02-13 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-30751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-30751.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> Combine the skewed readers into one in AQE skew join optimizations
> --
>
> Key: SPARK-30751
> URL: https://issues.apache.org/jira/browse/SPARK-30751
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> Assume we have N partitions based on the original join keys, and for a 
> specific partition id {{Pi}} (i = 1 to N), we slice the left partition into 
> {{Li}} sub-partitions (L = 1 if no skew; L > 1 if skewed), the right 
> partition into {{Mi}} sub-partitions (M = 1 if no skew; M > 1 if skewed). 
> With the current approach, we’ll end up with a sum of {{Li}} * {{Mi}} (i = 1 
> to N where Li > 1 or Mi > 1) plus one (for the rest of the partitions without 
> skew) joins. *This can be a serious performance concern as the size of the 
> query plan now depends on the number and size of skewed partitions.*
> Now instead of generating so many joins we can create a “repeated” reader for 
> either side of the join so that:
>  # for the left side, with each partition id Pi and any given slice {{Sj}} in 
> {{Pi}} (j = 1 to Li), it generates {{Mi}} repeated partitions with respective 
> join keys as {{PiSjT1}}, {{PiSjT2}}, …, {{PiSjTm}}
>  # for the right side, with each partition id Pi and any given slice {{Tk}} 
> in {{Pi}} (k = 1 to Mi), it generates {{Li}} repeated partitions with 
> respective join keys as {{PiS1Tk}}, {{PiS2Tk}}, …, {{PiSlTk}}
> That way, we can have one SMJ for all the partitions and only one type of 
> special reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30751) Combine the skewed readers into one in AQE skew join optimizations

2020-02-13 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-30751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell reassigned SPARK-30751:
-

Assignee: Wenchen Fan

> Combine the skewed readers into one in AQE skew join optimizations
> --
>
> Key: SPARK-30751
> URL: https://issues.apache.org/jira/browse/SPARK-30751
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wenchen Fan
>Priority: Major
>
> Assume we have N partitions based on the original join keys, and for a 
> specific partition id {{Pi}} (i = 1 to N), we slice the left partition into 
> {{Li}} sub-partitions (L = 1 if no skew; L > 1 if skewed), the right 
> partition into {{Mi}} sub-partitions (M = 1 if no skew; M > 1 if skewed). 
> With the current approach, we’ll end up with a sum of {{Li}} * {{Mi}} (i = 1 
> to N where Li > 1 or Mi > 1) plus one (for the rest of the partitions without 
> skew) joins. *This can be a serious performance concern as the size of the 
> query plan now depends on the number and size of skewed partitions.*
> Now instead of generating so many joins we can create a “repeated” reader for 
> either side of the join so that:
>  # for the left side, with each partition id Pi and any given slice {{Sj}} in 
> {{Pi}} (j = 1 to Li), it generates {{Mi}} repeated partitions with respective 
> join keys as {{PiSjT1}}, {{PiSjT2}}, …, {{PiSjTm}}
>  # for the right side, with each partition id Pi and any given slice {{Tk}} 
> in {{Pi}} (k = 1 to Mi), it generates {{Li}} repeated partitions with 
> respective join keys as {{PiS1Tk}}, {{PiS2Tk}}, …, {{PiSlTk}}
> That way, we can have one SMJ for all the partitions and only one type of 
> special reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30814) Add Columns references should be able to resolve each other

2020-02-13 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036436#comment-17036436
 ] 

Terry Kim commented on SPARK-30814:
---

Yea, I can work on this.

> Add Columns references should be able to resolve each other
> ---
>
> Key: SPARK-30814
> URL: https://issues.apache.org/jira/browse/SPARK-30814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Major
>
> In ResolveAlterTableChanges, we have checks that make sure that positional 
> arguments exist and are normalized around case sensitivity for ALTER TABLE 
> ADD COLUMNS. However, we missed the case, where a column in ADD COLUMNS can 
> depend on the position of a column that is just being added.
> For example for the schema:
> {code:java}
> root:
>   - a: string
>   - b: long
>  {code}
>  
> The following should work:
> {code:java}
> ALTER TABLE ... ADD COLUMNS (x int AFTER a, y int AFTER x) {code}
> Currently, the above statement will throw an error saying that AFTER x cannot 
> be resolved, because x doesn't exist yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30816) Fix dev-run-integration-tests.sh to ignore empty param correctly

2020-02-13 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-30816:
-

 Summary: Fix dev-run-integration-tests.sh to ignore empty param 
correctly
 Key: SPARK-30816
 URL: https://issues.apache.org/jira/browse/SPARK-30816
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Tests
Affects Versions: 2.4.5, 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30703) Add a documentation page for ANSI mode

2020-02-13 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-30703.

Resolution: Fixed

This issue is resolved by https://github.com/apache/spark/pull/27489

> Add a documentation page for ANSI mode
> --
>
> Key: SPARK-30703
> URL: https://issues.apache.org/jira/browse/SPARK-30703
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> ANSI mode is introduced in Spark 3.0. We need to clearly document the 
> behavior difference when spark.sql.ansi.enabled is on and off. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30815) Function to format timestamp with time zone

2020-02-13 Thread Enrico Minack (Jira)

Enrico Minack created SPARK-30815:
-

 Summary: Function to format timestamp with time zone
 Key: SPARK-30815
 URL: https://issues.apache.org/jira/browse/SPARK-30815
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Enrico Minack


Whenever timestamps are turned into strings (`Column.cast(StringType)`, 
`date_format(timestamp, format)`, `Dataset.show()`) the default time zone is 
used to format the string. This default time zone is either the java default 
zone `user.timezone`, the Spark default zone `spark.sql.session.timeZone` or 
the default DataFrameWriter zone `timeZone`. Currently there is no way to 
format a single column in a different timezone.
{code:java}
scala> spark.conf.set("spark.sql.session.timeZone", "Europe/London")
scala> spark.range(10). \
 select(concat(lit("2020-02-01 0"), $"id", 
lit(":00:00")).cast(TimestampType).as("time")). \
 select(
   $"time",
   date_format($"time", "-MM-dd HH:mm:ss Z").as("local"),
   date_format_tz($"time", "-MM-dd HH:mm:ss Z", 
"Europe/Berlin").as("Berlin")
 ). \
 show(false)
+---+-+--+
|time   |local|Berlin|
+---+-+--+
|2020-02-01 00:00:00|2020-02-01 00:00:00 Z|2020-02-01 01:00:00 +01:00|
|2020-02-01 01:00:00|2020-02-01 01:00:00 Z|2020-02-01 02:00:00 +01:00|
|2020-02-01 02:00:00|2020-02-01 02:00:00 Z|2020-02-01 03:00:00 +01:00|
|2020-02-01 03:00:00|2020-02-01 03:00:00 Z|2020-02-01 04:00:00 +01:00|
|2020-02-01 04:00:00|2020-02-01 04:00:00 Z|2020-02-01 05:00:00 +01:00|
|2020-02-01 05:00:00|2020-02-01 05:00:00 Z|2020-02-01 06:00:00 +01:00|
|2020-02-01 06:00:00|2020-02-01 06:00:00 Z|2020-02-01 07:00:00 +01:00|
|2020-02-01 07:00:00|2020-02-01 07:00:00 Z|2020-02-01 08:00:00 +01:00|
|2020-02-01 08:00:00|2020-02-01 08:00:00 Z|2020-02-01 09:00:00 +01:00|
|2020-02-01 09:00:00|2020-02-01 09:00:00 Z|2020-02-01 10:00:00 +01:00|
+---+-+--+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30703) Add a documentation page for ANSI mode

2020-02-13 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-30703:
--

Assignee: Takeshi Yamamuro

> Add a documentation page for ANSI mode
> --
>
> Key: SPARK-30703
> URL: https://issues.apache.org/jira/browse/SPARK-30703
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> ANSI mode is introduced in Spark 3.0. We need to clearly document the 
> behavior difference when spark.sql.ansi.enabled is on and off. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30814) Add Columns references should be able to resolve each other

2020-02-13 Thread Burak Yavuz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036418#comment-17036418
 ] 

Burak Yavuz commented on SPARK-30814:
-

cc [~cloud_fan] [~imback82], can we prioritize this over REPLACE COLUMNS if 
possible?

> Add Columns references should be able to resolve each other
> ---
>
> Key: SPARK-30814
> URL: https://issues.apache.org/jira/browse/SPARK-30814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Major
>
> In ResolveAlterTableChanges, we have checks that make sure that positional 
> arguments exist and are normalized around case sensitivity for ALTER TABLE 
> ADD COLUMNS. However, we missed the case, where a column in ADD COLUMNS can 
> depend on the position of a column that is just being added.
> For example for the schema:
> {code:java}
> root:
>   - a: string
>   - b: long
>  {code}
>  
> The following should work:
> {code:java}
> ALTER TABLE ... ADD COLUMNS (x int AFTER a, y int AFTER x) {code}
> Currently, the above statement will throw an error saying that AFTER x cannot 
> be resolved, because x doesn't exist yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30814) Add Columns references should be able to resolve each other

2020-02-13 Thread Burak Yavuz (Jira)

Burak Yavuz created SPARK-30814:
---

 Summary: Add Columns references should be able to resolve each 
other
 Key: SPARK-30814
 URL: https://issues.apache.org/jira/browse/SPARK-30814
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


In ResolveAlterTableChanges, we have checks that make sure that positional 
arguments exist and are normalized around case sensitivity for ALTER TABLE ADD 
COLUMNS. However, we missed the case, where a column in ADD COLUMNS can depend 
on the position of a column that is just being added.

For example for the schema:
{code:java}
root:
  - a: string
  - b: long
 {code}
 

The following should work:
{code:java}
ALTER TABLE ... ADD COLUMNS (x int AFTER a, y int AFTER x) {code}
Currently, the above statement will throw an error saying that AFTER x cannot 
be resolved, because x doesn't exist yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30811) CTE that refers to non-existent table with same name causes StackOverflowError

2020-02-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30811:
--
Affects Version/s: 2.0.2

> CTE that refers to non-existent table with same name causes StackOverflowError
> --
>
> Key: SPARK-30811
> URL: https://issues.apache.org/jira/browse/SPARK-30811
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>
> The following query causes a StackOverflowError:
> {noformat}
> WITH t AS (SELECT 1 FROM nonexist.t) SELECT * FROM t
> {noformat}
> This only happens when the CTE refers to a non-existent table with the same 
> name and a database qualifier. This is caused by a couple of things:
>  * {{CTESubstitution}} runs analysis on the CTE, but this does not throw an 
> exception because the table has a database qualifier. The reason is that we 
> don't fail is because we re-attempt to resolve the relation in a later rule.
>  * {{CTESubstitution}} replace logic does not check if the table it is 
> replacing has a database, it shouldn't replace the relation if it does. So 
> now we will happily replace {{nonexist.t}} with {{t}}.
>  * {{CTESubstitution}} transforms down, this means it will keep replacing 
> {{t}} with itself, creating an infinite recursion.
> This is not an issue for master/3.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30811) CTE that refers to non-existent table with same name causes StackOverflowError

2020-02-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30811:
--
Affects Version/s: 2.1.3

> CTE that refers to non-existent table with same name causes StackOverflowError
> --
>
> Key: SPARK-30811
> URL: https://issues.apache.org/jira/browse/SPARK-30811
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.5
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>
> The following query causes a StackOverflowError:
> {noformat}
> WITH t AS (SELECT 1 FROM nonexist.t) SELECT * FROM t
> {noformat}
> This only happens when the CTE refers to a non-existent table with the same 
> name and a database qualifier. This is caused by a couple of things:
>  * {{CTESubstitution}} runs analysis on the CTE, but this does not throw an 
> exception because the table has a database qualifier. The reason is that we 
> don't fail is because we re-attempt to resolve the relation in a later rule.
>  * {{CTESubstitution}} replace logic does not check if the table it is 
> replacing has a database, it shouldn't replace the relation if it does. So 
> now we will happily replace {{nonexist.t}} with {{t}}.
>  * {{CTESubstitution}} transforms down, this means it will keep replacing 
> {{t}} with itself, creating an infinite recursion.
> This is not an issue for master/3.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30811) CTE that refers to non-existent table with same name causes StackOverflowError

2020-02-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30811:
--
Affects Version/s: 2.2.3

> CTE that refers to non-existent table with same name causes StackOverflowError
> --
>
> Key: SPARK-30811
> URL: https://issues.apache.org/jira/browse/SPARK-30811
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.5
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>
> The following query causes a StackOverflowError:
> {noformat}
> WITH t AS (SELECT 1 FROM nonexist.t) SELECT * FROM t
> {noformat}
> This only happens when the CTE refers to a non-existent table with the same 
> name and a database qualifier. This is caused by a couple of things:
>  * {{CTESubstitution}} runs analysis on the CTE, but this does not throw an 
> exception because the table has a database qualifier. The reason is that we 
> don't fail is because we re-attempt to resolve the relation in a later rule.
>  * {{CTESubstitution}} replace logic does not check if the table it is 
> replacing has a database, it shouldn't replace the relation if it does. So 
> now we will happily replace {{nonexist.t}} with {{t}}.
>  * {{CTESubstitution}} transforms down, this means it will keep replacing 
> {{t}} with itself, creating an infinite recursion.
> This is not an issue for master/3.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30811) CTE that refers to non-existent table with same name causes StackOverflowError

2020-02-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30811:
--
Affects Version/s: 2.3.4

> CTE that refers to non-existent table with same name causes StackOverflowError
> --
>
> Key: SPARK-30811
> URL: https://issues.apache.org/jira/browse/SPARK-30811
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>
> The following query causes a StackOverflowError:
> {noformat}
> WITH t AS (SELECT 1 FROM nonexist.t) SELECT * FROM t
> {noformat}
> This only happens when the CTE refers to a non-existent table with the same 
> name and a database qualifier. This is caused by a couple of things:
>  * {{CTESubstitution}} runs analysis on the CTE, but this does not throw an 
> exception because the table has a database qualifier. The reason is that we 
> don't fail is because we re-attempt to resolve the relation in a later rule.
>  * {{CTESubstitution}} replace logic does not check if the table it is 
> replacing has a database, it shouldn't replace the relation if it does. So 
> now we will happily replace {{nonexist.t}} with {{t}}.
>  * {{CTESubstitution}} transforms down, this means it will keep replacing 
> {{t}} with itself, creating an infinite recursion.
> This is not an issue for master/3.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30762) Add dtype="float32" support to vector_to_array UDF

2020-02-13 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu resolved SPARK-30762.

Target Version/s: 3.0.0, 3.1.0
  Resolution: Done

Resolved by https://github.com/apache/spark/pull/27522

> Add dtype="float32" support to vector_to_array UDF
> --
>
> Key: SPARK-30762
> URL: https://issues.apache.org/jira/browse/SPARK-30762
> Project: Spark
>  Issue Type: Story
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Liang Zhang
>Assignee: Liang Zhang
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Previous PR: 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py]
> In the previous PR, we introduced a UDF to convert a column of MLlib Vecters 
> to a column of lists in python (Seq in scala). Currently, all the floating 
> numbers in a vector is converted to Double in scala. In this issue, we will 
> add a parameter in the python function {{vector_to_array(col)}} that allows 
> converting to Float (32bits) in scala, which would be mapped to a numpy array 
> of dtype=float32.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30762) Add dtype="float32" support to vector_to_array UDF

2020-02-13 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-30762:
---
Fix Version/s: (was: 3.1.0)
   (was: 3.0.0)

> Add dtype="float32" support to vector_to_array UDF
> --
>
> Key: SPARK-30762
> URL: https://issues.apache.org/jira/browse/SPARK-30762
> Project: Spark
>  Issue Type: Story
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Liang Zhang
>Assignee: Liang Zhang
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Previous PR: 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py]
> In the previous PR, we introduced a UDF to convert a column of MLlib Vecters 
> to a column of lists in python (Seq in scala). Currently, all the floating 
> numbers in a vector is converted to Double in scala. In this issue, we will 
> add a parameter in the python function {{vector_to_array(col)}} that allows 
> converting to Float (32bits) in scala, which would be mapped to a numpy array 
> of dtype=float32.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30762) Add dtype="float32" support to vector_to_array UDF

2020-02-13 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-30762:
---
Fix Version/s: 3.1.0
   3.0.0

> Add dtype="float32" support to vector_to_array UDF
> --
>
> Key: SPARK-30762
> URL: https://issues.apache.org/jira/browse/SPARK-30762
> Project: Spark
>  Issue Type: Story
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Liang Zhang
>Assignee: Liang Zhang
>Priority: Major
> Fix For: 3.0.0, 3.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Previous PR: 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py]
> In the previous PR, we introduced a UDF to convert a column of MLlib Vecters 
> to a column of lists in python (Seq in scala). Currently, all the floating 
> numbers in a vector is converted to Double in scala. In this issue, we will 
> add a parameter in the python function {{vector_to_array(col)}} that allows 
> converting to Float (32bits) in scala, which would be mapped to a numpy array 
> of dtype=float32.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28119) Cannot read environment variable inside custom property file

2020-02-13 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-28119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036325#comment-17036325
 ] 

MINET J-Sébastien edited comment on SPARK-28119 at 2/13/20 3:57 PM:


Seems that nobody care ;) Anyway I found a workaround by shading *and relocate* 
the new apache commons version from 1.6 to 1.10 directly in my delivered jar

 
{code:java}
 ...

   
commons-configuration
commons-configuration
1.10


commons-logging
commons-logging


commons-lang
commons-lang




 ...
 
 
  
 org.apache.maven.plugins
 maven-shade-plugin
 3.2.1
  
 
package
 
shade
 
 
   
   



org.apache.commons.configuration

my.company.commons.configuration

  
   

  
 
 ...
{code}


was (Author: jsfm):
Seems that nobody care ;) Anyway I found a workaround by shading the new apache 
commons version from 1.6 to 1.10 directly in my delivered jar

 
{code:java}
 ...

   
commons-configuration
commons-configuration
1.10


commons-logging
commons-logging


commons-lang
commons-lang




 ...
 
 
  
 org.apache.maven.plugins
 maven-shade-plugin
 3.2.1
  
 
package
 
shade
 
 
   
   



org.apache.commons.configuration

my.company.commons.configuration

  
   

  
 
 ...
{code}

> Cannot read environment variable inside custom property file
> 
>
> Key: SPARK-28119
> URL: https://issues.apache.org/jira/browse/SPARK-28119
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.3
> Environment: Linux ubuntu 16.10, spark standalone 2.4.3
>Reporter: MINET J-Sébastien
>Priority: Minor
>
> Spark is compiled with commons-configuration version 1.6 due to hadoop-client 
> library dependency
> {code:java}
> [INFO] | +- org.apache.hadoop:hadoop-client:jar:2.6.5:provided
> [INFO] | | +- org.apache.hadoop:hadoop-common:jar:2.6.5:provided
> [INFO] | | | +- commons-cli:commons-cli:jar:1.2:provided
> [INFO] | | | +- xmlenc:xmlenc:jar:0.52:provided
> [INFO] | | | +- commons-httpclient:commons-httpclient:jar:3.1:provided
> [INFO] | | | +- commons-io:commons-io:jar:2.4:provided
> [INFO] | | | +- commons-collections:commons-collections:jar:3.2.2:provided
> [INFO] | | | +- commons-configuration:commons-configuration:jar:1.6:provided
> {code}
> Here is my code
> {code:java}
> import org.apache.commons.configuration.ConfigurationException;
> import org.apache.commons.configuration.PropertiesConfiguration;
> import org.apache.spark.sql.SparkSession;
> public class SparkPropertyTest {
>public static void main(String... args) throws ConfigurationException {
>  SparkSession sp = SparkSession.builder().getOrCreate();
>  PropertiesConfiguration config = new PropertiesConfiguration();
>  String file = sp.sparkContext().getConf().get("spark.files");
>  sp.log().warn("Using property file {}", file);
>  config.load(file);
>  sp.log().warn(config.getString("env.path"));
>   }
> }
> {code}
> Here is the added contains to *log4j.properties*
> {code:java}
> env.path=${env:PATH}
> {code}
> If I launch spark job with following vm options
> {code:java}
> -Dspark.master=local[2] -Dspark.files=src/main/resources/log4j.properties
> {code}
> I get the result where the environement variable is printed as is
> {code:java}
> 2019-06-20 07:09:03 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2019-06-20 07:09:05 WARN SparkSession:11 - Using property file 
> src/main/resources/log4j.properties
> 2019-06-20 07:09:05 WARN SparkSession:13 - ${env:P

[jira] [Comment Edited] (SPARK-28119) Cannot read environment variable inside custom property file

2020-02-13 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-28119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036325#comment-17036325
 ] 

MINET J-Sébastien edited comment on SPARK-28119 at 2/13/20 3:56 PM:


Seems that nobody care ;) Anyway I found a workaround by shading the new apache 
commons version from 1.6 to 1.10 directly in my delivered jar

 
{code:java}
 ...

   
commons-configuration
commons-configuration
1.10


commons-logging
commons-logging


commons-lang
commons-lang




 ...
 
 
  
 org.apache.maven.plugins
 maven-shade-plugin
 3.2.1
  
 
package
 
shade
 
 
   
   



org.apache.commons.configuration

my.company.commons.configuration

  
   

  
 
 ...
{code}


was (Author: jsfm):
Seems that nobody care ;) Anyway I found a workaround by shading the new apache 
commons version from 1.6 to 1.10 directly in my delivered jar

 
{code:java}
 ...

   
commons-configuration
commons-configuration
1.10


commons-logging
commons-logging


commons-lang
commons-lang




 ...
 
 
  
 org.apache.maven.plugins
 maven-shade-plugin
 3.2.1
  
 
package
 
shade
 
 
   
   



org.apache.commons.configuration

my.company.commons.configuration

  
   

  
 
 ...
{code}
 

 

 

 

 

 

 

 

 

> Cannot read environment variable inside custom property file
> 
>
> Key: SPARK-28119
> URL: https://issues.apache.org/jira/browse/SPARK-28119
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.3
> Environment: Linux ubuntu 16.10, spark standalone 2.4.3
>Reporter: MINET J-Sébastien
>Priority: Minor
>
> Spark is compiled with commons-configuration version 1.6 due to hadoop-client 
> library dependency
> {code:java}
> [INFO] | +- org.apache.hadoop:hadoop-client:jar:2.6.5:provided
> [INFO] | | +- org.apache.hadoop:hadoop-common:jar:2.6.5:provided
> [INFO] | | | +- commons-cli:commons-cli:jar:1.2:provided
> [INFO] | | | +- xmlenc:xmlenc:jar:0.52:provided
> [INFO] | | | +- commons-httpclient:commons-httpclient:jar:3.1:provided
> [INFO] | | | +- commons-io:commons-io:jar:2.4:provided
> [INFO] | | | +- commons-collections:commons-collections:jar:3.2.2:provided
> [INFO] | | | +- commons-configuration:commons-configuration:jar:1.6:provided
> {code}
> Here is my code
> {code:java}
> import org.apache.commons.configuration.ConfigurationException;
> import org.apache.commons.configuration.PropertiesConfiguration;
> import org.apache.spark.sql.SparkSession;
> public class SparkPropertyTest {
>public static void main(String... args) throws ConfigurationException {
>  SparkSession sp = SparkSession.builder().getOrCreate();
>  PropertiesConfiguration config = new PropertiesConfiguration();
>  String file = sp.sparkContext().getConf().get("spark.files");
>  sp.log().warn("Using property file {}", file);
>  config.load(file);
>  sp.log().warn(config.getString("env.path"));
>   }
> }
> {code}
> Here is the added contains to *log4j.properties*
> {code:java}
> env.path=${env:PATH}
> {code}
> If I launch spark job with following vm options
> {code:java}
> -Dspark.master=local[2] -Dspark.files=src/main/resources/log4j.properties
> {code}
> I get the result where the environement variable is printed as is
> {code:java}
> 2019-06-20 07:09:03 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2019-06-20 07:09:05 WARN SparkSession:11 - Using property file 
> src/main/resources/log4j.properties
> 2019-06-20 07:09:05 WARN SparkSession:1

[jira] [Commented] (SPARK-28119) Cannot read environment variable inside custom property file

2020-02-13 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-28119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036325#comment-17036325
 ] 

MINET J-Sébastien commented on SPARK-28119:
---

Seems that nobody care ;) Anyway I found a workaround by shading the new apache 
commons version from 1.6 to 1.10 directly in my delivered jar

 
{code:java}
 ...

   
commons-configuration
commons-configuration
1.10


commons-logging
commons-logging


commons-lang
commons-lang




 ...
 
 
  
 org.apache.maven.plugins
 maven-shade-plugin
 3.2.1
  
 
package
 
shade
 
 
   
   



org.apache.commons.configuration

my.company.commons.configuration

  
   

  
 
 ...
{code}
 

 

 

 

 

 

 

 

 

> Cannot read environment variable inside custom property file
> 
>
> Key: SPARK-28119
> URL: https://issues.apache.org/jira/browse/SPARK-28119
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.3
> Environment: Linux ubuntu 16.10, spark standalone 2.4.3
>Reporter: MINET J-Sébastien
>Priority: Minor
>
> Spark is compiled with commons-configuration version 1.6 due to hadoop-client 
> library dependency
> {code:java}
> [INFO] | +- org.apache.hadoop:hadoop-client:jar:2.6.5:provided
> [INFO] | | +- org.apache.hadoop:hadoop-common:jar:2.6.5:provided
> [INFO] | | | +- commons-cli:commons-cli:jar:1.2:provided
> [INFO] | | | +- xmlenc:xmlenc:jar:0.52:provided
> [INFO] | | | +- commons-httpclient:commons-httpclient:jar:3.1:provided
> [INFO] | | | +- commons-io:commons-io:jar:2.4:provided
> [INFO] | | | +- commons-collections:commons-collections:jar:3.2.2:provided
> [INFO] | | | +- commons-configuration:commons-configuration:jar:1.6:provided
> {code}
> Here is my code
> {code:java}
> import org.apache.commons.configuration.ConfigurationException;
> import org.apache.commons.configuration.PropertiesConfiguration;
> import org.apache.spark.sql.SparkSession;
> public class SparkPropertyTest {
>public static void main(String... args) throws ConfigurationException {
>  SparkSession sp = SparkSession.builder().getOrCreate();
>  PropertiesConfiguration config = new PropertiesConfiguration();
>  String file = sp.sparkContext().getConf().get("spark.files");
>  sp.log().warn("Using property file {}", file);
>  config.load(file);
>  sp.log().warn(config.getString("env.path"));
>   }
> }
> {code}
> Here is the added contains to *log4j.properties*
> {code:java}
> env.path=${env:PATH}
> {code}
> If I launch spark job with following vm options
> {code:java}
> -Dspark.master=local[2] -Dspark.files=src/main/resources/log4j.properties
> {code}
> I get the result where the environement variable is printed as is
> {code:java}
> 2019-06-20 07:09:03 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2019-06-20 07:09:05 WARN SparkSession:11 - Using property file 
> src/main/resources/log4j.properties
> 2019-06-20 07:09:05 WARN SparkSession:13 - ${env:PATH}
> {code}
> Now I update my pom.xml
> {code:java}
> 
>commons-configuration
>commons-configuration
>1.10
> 
> {code}
> So the new result is
> {code:java}
> 2019-06-20 07:09:40 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2019-06-20 07:09:42 WARN SparkSession:11 - Using property file 
> src/main/resources/log4j.properties
> 2019-06-20 07:09:42 WARN SparkSession:13 - 
> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
> {code}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30813) Matrices.sprand mistakes in comments

2020-02-13 Thread Xiaochang Wu (Jira)

Xiaochang Wu created SPARK-30813:


 Summary: Matrices.sprand mistakes in comments 
 Key: SPARK-30813
 URL: https://issues.apache.org/jira/browse/SPARK-30813
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 3.0.0
Reporter: Xiaochang Wu


/**
 * Generate a `SparseMatrix` consisting of `i.i.d.` *gaussian random* numbers.

>> *should be "uniform random" here*

* @param numRows number of rows of the matrix
 * @param numCols number of columns of the matrix
 * @param density the desired density for the matrix
 * @param rng a random number generator
 * @return `Matrix` with size `numRows` x `numCols` and values in U(0, 1)
 */
 @Since("2.0.0")
 def sprand(numRows: Int, numCols: Int, density: Double, rng: Random): Matrix =
 SparseMatrix.sprand(numRows, numCols, density, rng)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29231) Constraints should be inferred from cast equality constraint

2020-02-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29231.
-
Fix Version/s: 3.1.0
 Assignee: Yuming Wang
   Resolution: Fixed

> Constraints should be inferred from cast equality constraint
> 
>
> Key: SPARK-29231
> URL: https://issues.apache.org/jira/browse/SPARK-29231
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> How to reproduce:
> {code:scala}
> scala> spark.sql("create table t1(c11 int, c12 decimal) ")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("create table t2(c21 bigint, c22 decimal) ")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("select t1.*, t2.* from t1 left join t2 on t1.c11=t2.c21 
> where t1.c11=1").explain
> == Physical Plan ==
> SortMergeJoin [cast(c11#0 as bigint)], [c21#2L], LeftOuter
> :- *(2) Sort [cast(c11#0 as bigint) ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(cast(c11#0 as bigint), 200), true, [id=#30]
> : +- *(1) Filter (isnotnull(c11#0) AND (c11#0 = 1))
> :+- Scan hive default.t1 [c11#0, c12#1], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c11#0, 
> c12#1], Statistics(sizeInBytes=8.0 EiB)
> +- *(4) Sort [c21#2L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(c21#2L, 200), true, [id=#37]
>   +- *(3) Filter isnotnull(c21#2L)
>  +- Scan hive default.t2 [c21#2L, c22#3], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c21#2L, 
> c22#3], Statistics(sizeInBytes=8.0 EiB)
> {code}
> PostgreSQL suport this feature:
> {code:sql}
> postgres=# create table t1(c11 int4, c12 decimal);
> CREATE TABLE
> postgres=# create table t2(c21 int8, c22 decimal);
> CREATE TABLE
> postgres=# explain select t1.*, t2.* from t1 left join t2 on t1.c11=t2.c21 
> where t1.c11=1;
>QUERY PLAN
> 
>  Nested Loop Left Join  (cost=0.00..51.43 rows=36 width=76)
>Join Filter: (t1.c11 = t2.c21)
>->  Seq Scan on t1  (cost=0.00..25.88 rows=6 width=36)
>  Filter: (c11 = 1)
>->  Materialize  (cost=0.00..25.03 rows=6 width=40)
>  ->  Seq Scan on t2  (cost=0.00..25.00 rows=6 width=40)
>Filter: (c21 = 1)
> (7 rows)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30812) Revise boolean config name according to new config naming policy

2020-02-13 Thread wuyi (Jira)

wuyi created SPARK-30812:


 Summary: Revise boolean config name according to new config naming 
policy
 Key: SPARK-30812
 URL: https://issues.apache.org/jira/browse/SPARK-30812
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.0.0
Reporter: wuyi


config naming policy:
http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-naming-policy-of-Spark-configs-td28875.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30758) Spark SQL can't display bracketed comments well in generated golden files

2020-02-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30758.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27481
[https://github.com/apache/spark/pull/27481]

> Spark SQL can't display bracketed comments well in generated golden files
> -
>
> Key: SPARK-30758
> URL: https://issues.apache.org/jira/browse/SPARK-30758
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.0.0
>
>
> Although Spark SQL support bracketed comments, but {{SQLQueryTestSuite}} 
> can't treat bracketed comments well and lead to generated golden files can't 
> display bracketed comments well.
> We can read the output of comments.sql
> [https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/results/postgreSQL/comments.sql.out]
> Such as:
>  
> {code:java}
> -- !query/* This is an example of SQL which should not execute: * select 
> 'multi-line'-- !query schemastruct<>-- !query 
> outputorg.apache.spark.sql.catalyst.parser.ParseException
> mismatched input '/' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 
> 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 
> 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 
> 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 
> 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 
> 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, 
> pos 0)
> == SQL ==/* This is an example of SQL which should not execute:^^^ * select 
> 'multi-line'
> -- !query*/SELECT 'after multi-line' AS fifth-- !query schemastruct<>-- 
> !query outputorg.apache.spark.sql.catalyst.parser.ParseException
> extraneous input '*/' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 
> 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 
> 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 
> 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 
> 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 
> 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, 
> pos 0)
> == SQL ==*/^^^SELECT 'after multi-line' AS fifth
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30758) Spark SQL can't display bracketed comments well in generated golden files

2020-02-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30758:
---

Assignee: jiaan.geng

> Spark SQL can't display bracketed comments well in generated golden files
> -
>
> Key: SPARK-30758
> URL: https://issues.apache.org/jira/browse/SPARK-30758
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> Although Spark SQL support bracketed comments, but {{SQLQueryTestSuite}} 
> can't treat bracketed comments well and lead to generated golden files can't 
> display bracketed comments well.
> We can read the output of comments.sql
> [https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/results/postgreSQL/comments.sql.out]
> Such as:
>  
> {code:java}
> -- !query/* This is an example of SQL which should not execute: * select 
> 'multi-line'-- !query schemastruct<>-- !query 
> outputorg.apache.spark.sql.catalyst.parser.ParseException
> mismatched input '/' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 
> 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 
> 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 
> 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 
> 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 
> 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, 
> pos 0)
> == SQL ==/* This is an example of SQL which should not execute:^^^ * select 
> 'multi-line'
> -- !query*/SELECT 'after multi-line' AS fifth-- !query schemastruct<>-- 
> !query outputorg.apache.spark.sql.catalyst.parser.ParseException
> extraneous input '*/' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 
> 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 
> 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 
> 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 
> 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 
> 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, 
> pos 0)
> == SQL ==*/^^^SELECT 'after multi-line' AS fifth
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30811) CTE that refers to non-existent table with same name causes StackOverflowError

2020-02-13 Thread Jira

Herman van Hövell created SPARK-30811:
-

 Summary: CTE that refers to non-existent table with same name 
causes StackOverflowError
 Key: SPARK-30811
 URL: https://issues.apache.org/jira/browse/SPARK-30811
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5
Reporter: Herman van Hövell


The following query causes a StackOverflowError:
{noformat}
WITH t AS (SELECT 1 FROM nonexist.t) SELECT * FROM t
{noformat}

This only happens when the CTE refers to a non-existent table with the same 
name and a database qualifier. This is caused by a couple of things:
 * {{CTESubstitution}} runs analysis on the CTE, but this does not throw an 
exception because the table has a database qualifier. The reason is that we 
don't fail is because we re-attempt to resolve the relation in a later rule.
 * {{CTESubstitution}} replace logic does not check if the table it is 
replacing has a database, it shouldn't replace the relation if it does. So now 
we will happily replace {{nonexist.t}} with {{t}}.
 * {{CTESubstitution}} transforms down, this means it will keep replacing {{t}} 
with itself, creating an infinite recursion.

This is not an issue for master/3.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30811) CTE that refers to non-existent table with same name causes StackOverflowError

2020-02-13 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-30811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell reassigned SPARK-30811:
-

Assignee: Herman van Hövell

> CTE that refers to non-existent table with same name causes StackOverflowError
> --
>
> Key: SPARK-30811
> URL: https://issues.apache.org/jira/browse/SPARK-30811
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>
> The following query causes a StackOverflowError:
> {noformat}
> WITH t AS (SELECT 1 FROM nonexist.t) SELECT * FROM t
> {noformat}
> This only happens when the CTE refers to a non-existent table with the same 
> name and a database qualifier. This is caused by a couple of things:
>  * {{CTESubstitution}} runs analysis on the CTE, but this does not throw an 
> exception because the table has a database qualifier. The reason is that we 
> don't fail is because we re-attempt to resolve the relation in a later rule.
>  * {{CTESubstitution}} replace logic does not check if the table it is 
> replacing has a database, it shouldn't replace the relation if it does. So 
> now we will happily replace {{nonexist.t}} with {{t}}.
>  * {{CTESubstitution}} transforms down, this means it will keep replacing 
> {{t}} with itself, creating an infinite recursion.
> This is not an issue for master/3.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer

2020-02-13 Thread Izek Greenfield (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036203#comment-17036203
 ] 

Izek Greenfield commented on SPARK-16387:
-

Hi [~dongjoon]

thanks for the quick response. we found that on oracle the addition of "" cause 
error in our other servers that try to read that table.

if the column name is `t_id` and you create it "t_id" if you will try to do 
`select T_ID from table` you will get an error from oracle.

from the docs on the method, it says that it will escape only reserved words 
but it actually escapes all... so I this it could be change to really escape 
only reserved words. 

Like I do in the link I put in the previous comment.

> Reserved SQL words are not escaped by JDBC writer
> -
>
> Key: SPARK-16387
> URL: https://issues.apache.org/jira/browse/SPARK-16387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Lev
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.0.0
>
>
> Here is a code (imports are omitted)
> object Main extends App {
>   val sqlSession = SparkSession.builder().config(new SparkConf().
> setAppName("Sql Test").set("spark.app.id", "SQLTest").
> set("spark.master", "local[2]").
> set("spark.ui.enabled", "false")
> .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" ))
>   ).getOrCreate()
>   import sqlSession.implicits._
>   val localprops = new Properties
>   localprops.put("user", "")
>   localprops.put("password", "")
>   val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order")
>   val writer = df.write
>   .mode(SaveMode.Append)
>   writer
>   .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops)
> }
> End error is :
> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error 
> in your SQL syntax; check the manual that corresponds to your MySQL server 
> version for the right syntax to use near 'order TEXT )' at line 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> Clearly the reserved word  has to be quoted



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30613) support hive style REPLACE COLUMN syntax

2020-02-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30613.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27482
[https://github.com/apache/spark/pull/27482]

> support hive style REPLACE COLUMN syntax
> 
>
> Key: SPARK-30613
> URL: https://issues.apache.org/jira/browse/SPARK-30613
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.1.0
>
>
> We already support the hive style CHANGE COLUMN syntax, I think it's better 
> to also support hive style REPLACE COLUMN syntax. Please refer to the doc: 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30613) support hive style REPLACE COLUMN syntax

2020-02-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30613:
---

Assignee: Terry Kim

> support hive style REPLACE COLUMN syntax
> 
>
> Key: SPARK-30613
> URL: https://issues.apache.org/jira/browse/SPARK-30613
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Terry Kim
>Priority: Major
>
> We already support the hive style CHANGE COLUMN syntax, I think it's better 
> to also support hive style REPLACE COLUMN syntax. Please refer to the doc: 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30528) Potential performance regression with DPP subquery duplication

2020-02-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30528.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27551
[https://github.com/apache/spark/pull/27551]

> Potential performance regression with DPP subquery duplication
> --
>
> Key: SPARK-30528
> URL: https://issues.apache.org/jira/browse/SPARK-30528
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.0.0
>Reporter: Mayur Bhosale
>Assignee: Wei Xue
>Priority: Major
>  Labels: performance
> Fix For: 3.0.0
>
> Attachments: cases.png, dup_subquery.png, plan.png
>
>
> In DPP, heuristics to decide if DPP is going to benefit relies on the sizes 
> of the tables in the right subtree of the join. This might not be a correct 
> estimate especially when the detailed column level stats are not available.
> {code:java}
> // the pruning overhead is the total size in bytes of all scan relations
> val overhead = 
> otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
> filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
> {code}
> Also, DPP executes the entire right side of the join as a subquery because of 
> which multiple scans happen for the tables in the right subtree of the join. 
> This can cause issues when join is non-Broadcast Hash Join (BHJ) and reuse of 
> the subquery result does not happen. Also, I couldn’t figure out, why do the 
> results from the subquery get re-used only for BHJ?
>  
> Consider a query,
> {code:java}
> SELECT * 
> FROM   store_sales_partitioned 
>JOIN (SELECT * 
>  FROM   store_returns_partitioned, 
> date_dim 
>  WHERE  sr_returned_date_sk = d_date_sk) ret_date 
>  ON ss_sold_date_sk = d_date_sk 
> WHERE  d_fy_quarter_seq > 0 
> {code}
> DPP will kick-in for both the join. (Please check the image plan.png attached 
> below for the plan)
> Some of the observations -
>  * Based on heuristics, DPP would go ahead with pruning if the cost of 
> scanning the tables in the right sub-tree of the join is less than the 
> benefit due to pruning. This is due to the reason that multiple scans will be 
> needed for an SMJ. But heuristics simply checks if the benefits offset the 
> cost of multiple scans and do not take into consideration other operations 
> like Join, etc in the right subtree which can be quite expensive. This issue 
> will be particularly prominent when detailed column level stats are not 
> available. In the example above, a decision that pruningHasBenefit was made 
> on the basis of sizes of the tables store_returns_partitioned and date_dim 
> but did not take into consideration the join between them before the join 
> happens with the store_sales_partitioned table.
>  * Multiple scans are needed when the join is SMJ as the reuse of the 
> exchanges does not happen. This is because Aggregate gets added on top of the 
> right subtree to be executed as a subquery in order to prune only required 
> columns. Here, scanning all the columns as the right subtree of the join 
> would, and reusing the same exchange might be more helpful as it avoids 
> duplicate scans.
> This was just a representative example, but in-general for cases such as in 
> the image cases.png below, DPP can cause performance issues.
>  
> Also, for the cases when there are multiple DPP compatible join conditions in 
> the same join, the entire right subtree of the join would be executed as a 
> subquery that many times. Consider an example,
> {code:java}
> SELECT * 
> FROM   partitionedtable 
>    JOIN nonpartitionedtable 
>  ON partcol1 = col1 
> AND partcol2 = col2 
> WHERE  nonpartitionedtable.id > 0 
> {code}
> Here the right subtree of the join (scan of table nonpartitionedtable) would 
> be executed twice as a subquery, once each for the every join condition. 
> These two subqueries should be aggregated and executed only once as they are 
> almost the same apart from the columns that they prune. Check the image 
> dup_subquery.png attached below for the details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30528) Potential performance regression with DPP subquery duplication

2020-02-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30528:
---

Assignee: Wei Xue

> Potential performance regression with DPP subquery duplication
> --
>
> Key: SPARK-30528
> URL: https://issues.apache.org/jira/browse/SPARK-30528
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.0.0
>Reporter: Mayur Bhosale
>Assignee: Wei Xue
>Priority: Major
>  Labels: performance
> Attachments: cases.png, dup_subquery.png, plan.png
>
>
> In DPP, heuristics to decide if DPP is going to benefit relies on the sizes 
> of the tables in the right subtree of the join. This might not be a correct 
> estimate especially when the detailed column level stats are not available.
> {code:java}
> // the pruning overhead is the total size in bytes of all scan relations
> val overhead = 
> otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
> filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
> {code}
> Also, DPP executes the entire right side of the join as a subquery because of 
> which multiple scans happen for the tables in the right subtree of the join. 
> This can cause issues when join is non-Broadcast Hash Join (BHJ) and reuse of 
> the subquery result does not happen. Also, I couldn’t figure out, why do the 
> results from the subquery get re-used only for BHJ?
>  
> Consider a query,
> {code:java}
> SELECT * 
> FROM   store_sales_partitioned 
>JOIN (SELECT * 
>  FROM   store_returns_partitioned, 
> date_dim 
>  WHERE  sr_returned_date_sk = d_date_sk) ret_date 
>  ON ss_sold_date_sk = d_date_sk 
> WHERE  d_fy_quarter_seq > 0 
> {code}
> DPP will kick-in for both the join. (Please check the image plan.png attached 
> below for the plan)
> Some of the observations -
>  * Based on heuristics, DPP would go ahead with pruning if the cost of 
> scanning the tables in the right sub-tree of the join is less than the 
> benefit due to pruning. This is due to the reason that multiple scans will be 
> needed for an SMJ. But heuristics simply checks if the benefits offset the 
> cost of multiple scans and do not take into consideration other operations 
> like Join, etc in the right subtree which can be quite expensive. This issue 
> will be particularly prominent when detailed column level stats are not 
> available. In the example above, a decision that pruningHasBenefit was made 
> on the basis of sizes of the tables store_returns_partitioned and date_dim 
> but did not take into consideration the join between them before the join 
> happens with the store_sales_partitioned table.
>  * Multiple scans are needed when the join is SMJ as the reuse of the 
> exchanges does not happen. This is because Aggregate gets added on top of the 
> right subtree to be executed as a subquery in order to prune only required 
> columns. Here, scanning all the columns as the right subtree of the join 
> would, and reusing the same exchange might be more helpful as it avoids 
> duplicate scans.
> This was just a representative example, but in-general for cases such as in 
> the image cases.png below, DPP can cause performance issues.
>  
> Also, for the cases when there are multiple DPP compatible join conditions in 
> the same join, the entire right subtree of the join would be executed as a 
> subquery that many times. Consider an example,
> {code:java}
> SELECT * 
> FROM   partitionedtable 
>    JOIN nonpartitionedtable 
>  ON partcol1 = col1 
> AND partcol2 = col2 
> WHERE  nonpartitionedtable.id > 0 
> {code}
> Here the right subtree of the join (scan of table nonpartitionedtable) would 
> be executed twice as a subquery, once each for the every join condition. 
> These two subqueries should be aggregated and executed only once as they are 
> almost the same apart from the columns that they prune. Check the image 
> dup_subquery.png attached below for the details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30810) Allows to parse a Dataset having different column from 'value' in csv(dataset) API

2020-02-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30810:
-
Summary: Allows to parse a Dataset having different column from 'value' in 
csv(dataset) API  (was: Allows to parse a Dataset having different column from 
'value')

> Allows to parse a Dataset having different column from 'value' in 
> csv(dataset) API
> --
>
> Key: SPARK-30810
> URL: https://issues.apache.org/jira/browse/SPARK-30810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> val ds = spark.range(10).selectExpr("'a, b, c' AS text").as[String]
> spark.read.csv(ds).show()
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given input 
> columns: [text];;
> 'Filter (length(trim('value, None)) > 0)
> +- Project [a, b, c AS text#2]
>+- Range (0, 10, step=1, splits=Some(2))
> {code}
> It fails to create a CSV parsed DataFrame from a String Dataset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30810) Allows to parse a Dataset having different column from 'value'

2020-02-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30810:
-
Issue Type: Bug  (was: Improvement)

> Allows to parse a Dataset having different column from 'value'
> --
>
> Key: SPARK-30810
> URL: https://issues.apache.org/jira/browse/SPARK-30810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> val ds = spark.range(10).selectExpr("'a, b, c' AS text").as[String]
> spark.read.csv(ds).show()
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given input 
> columns: [text];;
> 'Filter (length(trim('value, None)) > 0)
> +- Project [a, b, c AS text#2]
>+- Range (0, 10, step=1, splits=Some(2))
> {code}
> It fails to create a CSV parsed DataFrame from a String Dataset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30810) Allows to parse a Dataset having different column from 'value'

2020-02-13 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-30810:


 Summary: Allows to parse a Dataset having different column from 
'value'
 Key: SPARK-30810
 URL: https://issues.apache.org/jira/browse/SPARK-30810
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


{code}
val ds = spark.range(10).selectExpr("'a, b, c' AS text").as[String]
spark.read.csv(ds).show()
{code}

{code}
org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given input 
columns: [text];;
'Filter (length(trim('value, None)) > 0)
+- Project [a, b, c AS text#2]
   +- Range (0, 10, step=1, splits=Some(2))
{code}

It fails to create a CSV parsed DataFrame from a String Dataset.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30809) Review and fix issues in SQL API docs

2020-02-13 Thread Yuanjian Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-30809:

Summary: Review and fix issues in SQL API docs  (was: Review and fix issues 
in SQL and Core API docs)

> Review and fix issues in SQL API docs
> -
>
> Key: SPARK-30809
> URL: https://issues.apache.org/jira/browse/SPARK-30809
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30809) Review and fix issues in SQL API docs

2020-02-13 Thread Yuanjian Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-30809:

Component/s: (was: Spark Core)

> Review and fix issues in SQL API docs
> -
>
> Key: SPARK-30809
> URL: https://issues.apache.org/jira/browse/SPARK-30809
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30809) Review and fix issues in SQL and Core API docs

2020-02-13 Thread Yuanjian Li (Jira)

Yuanjian Li created SPARK-30809:
---

 Summary: Review and fix issues in SQL and Core API docs
 Key: SPARK-30809
 URL: https://issues.apache.org/jira/browse/SPARK-30809
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Affects Versions: 3.0.0
Reporter: Yuanjian Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30808) Thrift server returns wrong timestamps/dates strings before 1582

2020-02-13 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30808:
--

 Summary: Thrift server returns wrong timestamps/dates strings 
before 1582
 Key: SPARK-30808
 URL: https://issues.apache.org/jira/browse/SPARK-30808
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Set the environment variable:
{code:bash}
export TZ="America/Los_Angeles"
./bin/spark-sql -S
{code}
{code:sql}
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone  America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:07:02
{code}

The expected result must be *1001-01-01 00:00:00*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30743) Use JRE instead of JDK in K8S integration test

2020-02-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30743:
--
Affects Version/s: (was: 3.1.0)
   3.0.0

> Use JRE instead of JDK in K8S integration test
> --
>
> Key: SPARK-30743
> URL: https://issues.apache.org/jira/browse/SPARK-30743
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.1.0
>
>
> This will save some resources and make it sure we only needs JRE at runtime 
> and testing.
> - 
> https://lists.apache.org/thread.html/3145150b711d7806a86bcd3ab43e18bcd0e4892ab5f11600689ba087%40%3Cdev.spark.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30743) Use JRE instead of JDK in K8S integration test

2020-02-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30743:
--
Fix Version/s: (was: 3.1.0)
   3.0.0

> Use JRE instead of JDK in K8S integration test
> --
>
> Key: SPARK-30743
> URL: https://issues.apache.org/jira/browse/SPARK-30743
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>
> This will save some resources and make it sure we only needs JRE at runtime 
> and testing.
> - 
> https://lists.apache.org/thread.html/3145150b711d7806a86bcd3ab43e18bcd0e4892ab5f11600689ba087%40%3Cdev.spark.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30743) Use JRE instead of JDK in K8S integration test

2020-02-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30743:
--
Parent: SPARK-29194
Issue Type: Sub-task  (was: Improvement)

> Use JRE instead of JDK in K8S integration test
> --
>
> Key: SPARK-30743
> URL: https://issues.apache.org/jira/browse/SPARK-30743
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.1.0
>
>
> This will save some resources and make it sure we only needs JRE at runtime 
> and testing.
> - 
> https://lists.apache.org/thread.html/3145150b711d7806a86bcd3ab43e18bcd0e4892ab5f11600689ba087%40%3Cdev.spark.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30807) Support JDK11 in K8S integration test

2020-02-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30807:
-

Assignee: Dongjoon Hyun

> Support JDK11 in K8S integration test
> -
>
> Key: SPARK-30807
> URL: https://issues.apache.org/jira/browse/SPARK-30807
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30807) Support JDK11 in K8S integration test

2020-02-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30807:
--
Summary: Support JDK11 in K8S integration test  (was: Support JDK11 test in 
K8S integration test)

> Support JDK11 in K8S integration test
> -
>
> Key: SPARK-30807
> URL: https://issues.apache.org/jira/browse/SPARK-30807
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30807) Support JDK11 test in K8S integration test

2020-02-13 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-30807:
-

 Summary: Support JDK11 test in K8S integration test
 Key: SPARK-30807
 URL: https://issues.apache.org/jira/browse/SPARK-30807
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes, Tests
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

85 matches

Mail list logo