[jira] [Resolved] (SPARK-35937) Extracting date field from timestamp should work in ANSI mode

2021-06-29 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-35937.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33138
[https://github.com/apache/spark/pull/33138]

> Extracting date field from timestamp should work in ANSI mode
> -
>
> Key: SPARK-35937
> URL: https://issues.apache.org/jira/browse/SPARK-35937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
>  Add a new ANSI type coercion rule: when getting a date field from a 
> Timestamp column, cast the column as Date type.
>  This is Spark's hack to make the implementation simple. In the default type 
> coercion rules, the implicit cast rule does the work. However, The ANSI 
> implicit cast rule doesn't allow converting Timestamp type as Date type, so 
> we need to have this additional rule
>  to make sure the date field extraction from Timestamp columns works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-35915) Kafka doesn't recover from data loss

2021-06-29 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371817#comment-17371817
 ] 

Jungtaek Lim edited comment on SPARK-35915 at 6/30/21, 5:27 AM:


OK the code comment was referring the old behavior, and it may be still better 
to follow the comment. That looks to be something I missed when introducing 
consumer pool.

So we don't expect anything other than pool to call close(), but I admit it's 
easy to make mistake. (I meant in Spark codebase.) We should probably need to 
change something to advise callers not to do it.


was (Author: kabhwan):
OK the code comment was referring the old behavior, and it may be still better 
to follow the comment. That looks to be something I missed when introducing 
consumer pool.

So we don't expect anything other than pool to call close(), but I admit it's 
easy to make mistake. We should probably need to change something to advise 
callers not to do it.

> Kafka doesn't recover from data loss
> 
>
> Key: SPARK-35915
> URL: https://issues.apache.org/jira/browse/SPARK-35915
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.1
>Reporter: Yuval Yellin
>Priority: Major
>
> I configured a strcutured streaming source for kafka with 
> failOnDataLoss=false, 
> Getting this error when checkopint offsets are not found :
>  
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 7 in stage 5.0 failed 1 times, most recent failure: Lost task 7.0 in 
> stage 5.0 (TID 113) ( executor driver): java.lang.IllegalStateException: This 
> consumer has already been closed.
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquireAndEnsureOpen(KafkaConsumer.java:2439)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToBeginning(KafkaConsumer.java:1656)
>   at 
> org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumer.getAvailableOffsetRange(KafkaDataConsumer.scala:108)
>   at 
> org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.getEarliestAvailableOffsetBetween(KafkaDataConsumer.scala:385)
>   at 
> org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:332)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:604)
>   at 
> org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.get(KafkaDataConsumer.scala:287)
>   at 
> org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.next(KafkaBatchPartitionReader.scala:63)
>   at 
> org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
>   at 
> org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> {code}
>  
> The issue seems to me to be related to the OffsetOutOfRange exception in 
> (line (323 in KafkaDataConsumer): 
>  
> {code:java}
>  case e: OffsetOutOfRangeException =>
> // When there is some error thrown, it's better to use a new consumer to 
> drop all cached
> // states in the old consumer. We don't need to worry about the 
> performance because this
> // is not a common path.
> releaseConsumer()
> fetchedData.reset()
> reportDataLoss(topicPartition, groupId, failOnDataLoss,
>   s"Cannot fetch offset $toFetchOffset", e)
> toFetchOffset = getEarliestAvailableOffsetBetween(consumer, 
> toFetchOffset, untilOffset)
> }
> {code}
> seems like releaseConsumer will destoy the consumer , which later is used ...
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35915) Kafka doesn't recover from data loss

2021-06-29 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371817#comment-17371817
 ] 

Jungtaek Lim commented on SPARK-35915:
--

OK the code comment was referring the old behavior, and it may be still better 
to follow the comment. That looks to be something I missed when introducing 
consumer pool.

So we don't expect anything other than pool to call close(), but I admit it's 
easy to make mistake. We should probably need to change something to advise 
callers not to do it.

> Kafka doesn't recover from data loss
> 
>
> Key: SPARK-35915
> URL: https://issues.apache.org/jira/browse/SPARK-35915
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.1
>Reporter: Yuval Yellin
>Priority: Major
>
> I configured a strcutured streaming source for kafka with 
> failOnDataLoss=false, 
> Getting this error when checkopint offsets are not found :
>  
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 7 in stage 5.0 failed 1 times, most recent failure: Lost task 7.0 in 
> stage 5.0 (TID 113) ( executor driver): java.lang.IllegalStateException: This 
> consumer has already been closed.
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquireAndEnsureOpen(KafkaConsumer.java:2439)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToBeginning(KafkaConsumer.java:1656)
>   at 
> org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumer.getAvailableOffsetRange(KafkaDataConsumer.scala:108)
>   at 
> org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.getEarliestAvailableOffsetBetween(KafkaDataConsumer.scala:385)
>   at 
> org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:332)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:604)
>   at 
> org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.get(KafkaDataConsumer.scala:287)
>   at 
> org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.next(KafkaBatchPartitionReader.scala:63)
>   at 
> org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
>   at 
> org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> {code}
>  
> The issue seems to me to be related to the OffsetOutOfRange exception in 
> (line (323 in KafkaDataConsumer): 
>  
> {code:java}
>  case e: OffsetOutOfRangeException =>
> // When there is some error thrown, it's better to use a new consumer to 
> drop all cached
> // states in the old consumer. We don't need to worry about the 
> performance because this
> // is not a common path.
> releaseConsumer()
> fetchedData.reset()
> reportDataLoss(topicPartition, groupId, failOnDataLoss,
>   s"Cannot fetch offset $toFetchOffset", e)
> toFetchOffset = getEarliestAvailableOffsetBetween(consumer, 
> toFetchOffset, untilOffset)
> }
> {code}
> seems like releaseConsumer will destoy the consumer , which later is used ...
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35946) Respect Py4J server if InheritableThread API

2021-06-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35946:
-

Assignee: Hyukjin Kwon

> Respect Py4J server if InheritableThread API
> 
>
> Key: SPARK-35946
> URL: https://issues.apache.org/jira/browse/SPARK-35946
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently we sets the enviornment variables at the client side of Py4J 
> (python/pyspark/util.py ). If the Py4J gateway is created somewhere else 
> (e.g., Zeppelin, etc), it could introduce a breakage at:
> {code}
> from pyspark import SparkContext
> jvm = SparkContext._jvm
> thread_connection = jvm._gateway_client.get_thread_connection()
> # ^ the MLlibMLflowIntegrationSuite test suite failed at this line
> # `AttributeError: 'GatewayClient' object has no attribute 
> 'get_thread_connection'`
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35946) Respect Py4J server if InheritableThread API

2021-06-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35946.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33147
[https://github.com/apache/spark/pull/33147]

> Respect Py4J server if InheritableThread API
> 
>
> Key: SPARK-35946
> URL: https://issues.apache.org/jira/browse/SPARK-35946
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently we sets the enviornment variables at the client side of Py4J 
> (python/pyspark/util.py ). If the Py4J gateway is created somewhere else 
> (e.g., Zeppelin, etc), it could introduce a breakage at:
> {code}
> from pyspark import SparkContext
> jvm = SparkContext._jvm
> thread_connection = jvm._gateway_client.get_thread_connection()
> # ^ the MLlibMLflowIntegrationSuite test suite failed at this line
> # `AttributeError: 'GatewayClient' object has no attribute 
> 'get_thread_connection'`
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35829) Clean up evaluates subexpressions and add more flexibility to evaluate particular subexpressoin

2021-06-29 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-35829.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32980
[https://github.com/apache/spark/pull/32980]

> Clean up evaluates subexpressions and add more flexibility to evaluate 
> particular subexpressoin
> ---
>
> Key: SPARK-35829
> URL: https://issues.apache.org/jira/browse/SPARK-35829
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently `subexpressionEliminationForWholeStageCodegen` return` the gen-ed 
> code of subexpressions. The caller simply puts the code into its code block. 
> We need more flexible evaluation here. For example, for Filter operator's 
> subexpression evaluation, we may need to evaluate particular subexpression 
> for one predicate. Current approach cannot satisfy the requirement. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35896) [SS] Include more granular metrics for stateful operators in StreamingQueryProgress

2021-06-29 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-35896:


Assignee: Venki Korukanti

> [SS] Include more granular metrics for stateful operators in 
> StreamingQueryProgress
> ---
>
> Key: SPARK-35896
> URL: https://issues.apache.org/jira/browse/SPARK-35896
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.2
>Reporter: Venki Korukanti
>Assignee: Venki Korukanti
>Priority: Major
>
> Currently the streaming progress is missing a few important stateful operator 
> metrics in {{StateOperatorProgress}}. Each stateful operator consists of 
> multiple steps. Ex: {{flatMapGroupsWithState}} has two major steps: 1) 
> processing the input and 2) timeout processing to remove entries from the 
> state which have expired. The main motivation is to track down the time it 
> took for each individual step (such as timeout processing, watermark 
> processing etc) and how much data is processed to pinpoint the bottlenecks 
> and compare for reasoning why some microbatches are slow compared to others 
> in the same job.
> Below are the final metrics common to all stateful operators (the one in 
> _*bold-italic*_ are proposed new). These metrics are in 
> {{StateOperatorProgress}} which is part of {{StreamingQueryProgress}}.
>  * _*operatorName*_ - State operator name. Can help us identify any operator 
> specific slowness and state store usage patterns. Ex. "dedupe" (derived using 
> {{StateStoreWriter.shortName}})
>  * _numRowsTotal_ - number of rows in the state store across all tasks in a 
> stage where the operator has executed.
>  * _numRowsUpdated_ - number of rows added to or update in the store
>  * _*allUpdatesTimeMs*_ - time taken to add new rows or update existing state 
> store rows across all tasks in a stage where the operator has executed.
>  * _*numRowsRemoved*_ - number of rows deleted from state store as part of 
> the state cleanup mechanism across all tasks in a stage where the operator 
> has executed. This number helps measure the state store deletions and impact 
> on checkpoint commit and other latencies.
>  * _*allRemovalsTimeMs*_ - time taken to remove the rows from the state store 
> as part of state (also includes the iterating through the entire state store 
> to find which rows to delete) across all tasks in a stage where the operator 
> has executed. If we see jobs spending significant time here, it may justify a 
> better layout in the state store to read only the required rows than the 
> entire state store that is read currently.
>  * _*commitTimeMs*_ - time taken to commit the state store changes to 
> external storage for checkpointing. This is cumulative across all tasks in a 
> stage where this operator has executed.
>  * _*numShufflePartitions*_ - number of shuffle partitions this state 
> operator is part of. Currently the metrics like times are aggregated across 
> all tasks in a stage where the operator has executed. Having the number 
> shuffle partitions (corresponds to number of tasks) helps us find the average 
> task contribution to the metric.
>  * _*numStateStores*_ - number of state stores in the operator across all 
> tasks in the stage. Some stateful operators have more than one state store 
> (eg. stream-stream join). Tracking this number helps us find correlations 
> between state stores instances and microbatch latency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35896) [SS] Include more granular metrics for stateful operators in StreamingQueryProgress

2021-06-29 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-35896.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33091
[https://github.com/apache/spark/pull/33091]

> [SS] Include more granular metrics for stateful operators in 
> StreamingQueryProgress
> ---
>
> Key: SPARK-35896
> URL: https://issues.apache.org/jira/browse/SPARK-35896
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.2
>Reporter: Venki Korukanti
>Assignee: Venki Korukanti
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently the streaming progress is missing a few important stateful operator 
> metrics in {{StateOperatorProgress}}. Each stateful operator consists of 
> multiple steps. Ex: {{flatMapGroupsWithState}} has two major steps: 1) 
> processing the input and 2) timeout processing to remove entries from the 
> state which have expired. The main motivation is to track down the time it 
> took for each individual step (such as timeout processing, watermark 
> processing etc) and how much data is processed to pinpoint the bottlenecks 
> and compare for reasoning why some microbatches are slow compared to others 
> in the same job.
> Below are the final metrics common to all stateful operators (the one in 
> _*bold-italic*_ are proposed new). These metrics are in 
> {{StateOperatorProgress}} which is part of {{StreamingQueryProgress}}.
>  * _*operatorName*_ - State operator name. Can help us identify any operator 
> specific slowness and state store usage patterns. Ex. "dedupe" (derived using 
> {{StateStoreWriter.shortName}})
>  * _numRowsTotal_ - number of rows in the state store across all tasks in a 
> stage where the operator has executed.
>  * _numRowsUpdated_ - number of rows added to or update in the store
>  * _*allUpdatesTimeMs*_ - time taken to add new rows or update existing state 
> store rows across all tasks in a stage where the operator has executed.
>  * _*numRowsRemoved*_ - number of rows deleted from state store as part of 
> the state cleanup mechanism across all tasks in a stage where the operator 
> has executed. This number helps measure the state store deletions and impact 
> on checkpoint commit and other latencies.
>  * _*allRemovalsTimeMs*_ - time taken to remove the rows from the state store 
> as part of state (also includes the iterating through the entire state store 
> to find which rows to delete) across all tasks in a stage where the operator 
> has executed. If we see jobs spending significant time here, it may justify a 
> better layout in the state store to read only the required rows than the 
> entire state store that is read currently.
>  * _*commitTimeMs*_ - time taken to commit the state store changes to 
> external storage for checkpointing. This is cumulative across all tasks in a 
> stage where this operator has executed.
>  * _*numShufflePartitions*_ - number of shuffle partitions this state 
> operator is part of. Currently the metrics like times are aggregated across 
> all tasks in a stage where the operator has executed. Having the number 
> shuffle partitions (corresponds to number of tasks) helps us find the average 
> task contribution to the metric.
>  * _*numStateStores*_ - number of state stores in the operator across all 
> tasks in the stage. Some stateful operators have more than one state store 
> (eg. stream-stream join). Tracking this number helps us find correlations 
> between state stores instances and microbatch latency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33298) Introduce new API to FileCommitProtocol allow flexible file naming

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371795#comment-17371795
 ] 

Apache Spark commented on SPARK-33298:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/33148

> Introduce new API to FileCommitProtocol allow flexible file naming
> --
>
> Key: SPARK-33298
> URL: https://issues.apache.org/jira/browse/SPARK-33298
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.2.0
>
>
> This Jira is to propose a new version for `FileCommitProtocol` 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala]
>  ), e.g. `FileCommitProtocolV2`.
> The motivation is currently we have two requirements to change the API for 
> FileCommitProtocol:
> (1).Support write Hive ORC/Parquet bucketed table 
> ([https://github.com/apache/spark/pull/30003] ): need to add new parameter 
> `prefix` into method `newTaskTempFile` and `newTaskTempFileAbsPath`, to allow 
> spark writes hive/presto-compatible bucketed files.
> (2).Fix commit collision in dynamic partition overwrite mode 
> ([https://github.com/apache/spark/pull/29000] ): need to add new method 
> `getStagingDir` to allow customize dynamic partition staging directory to 
> avoid commit collision.
>  
> The reason to propose FileCommitProtocolV2 instead of changing 
> `FileCommitProtocol` directly, is that the API for FileCommitProtocolV2 is 
> kind of public where we allow customized commit protocol subclass to use 
> during run-time 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala#L146]
>  ). So if we change the API (e.g. adding method, or changing existing method 
> signature), it will break external subclass for the commit protocol. And we 
> are aware of some of external subclasses for better support of object store, 
> according to [~cloud_fan] .
>  
> One proposal for `FileCommitProtocolV2` can be:
> {code:java}
> abstract class FileCommitProtocolV2 {
>   // `options` to replace `ext`, where we can put more string-string 
> parameters
>   def newTaskTempFile(taskContext: TaskAttemptContext, dir: Option[String],
> options: Map[String, String]): String
>  
>   // `options` to replace `ext`, where we can put more string-string 
> parameters
>   def newTaskTempFileAbsPath(
> taskContext: TaskAttemptContext, absoluteDir: String, options: 
> Map[String, String]): String
>   // other new methods, e.g. getStagingDir
>   def getStagingDir(path: String, jobId: String): Path
>   // rest of FileCommitProtocol methods
>   ...
> }
> {code}
>  
> FileCommitProtocolV2.instantiate() logic will first try to find a subclass 
> for `FileCommitProtocolV2`, if not will find a subclass for 
> `FileCommitProtocol`, so the current version of `FileCommitProtocol` is still 
> supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33298) Introduce new API to FileCommitProtocol allow flexible file naming

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371794#comment-17371794
 ] 

Apache Spark commented on SPARK-33298:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/33148

> Introduce new API to FileCommitProtocol allow flexible file naming
> --
>
> Key: SPARK-33298
> URL: https://issues.apache.org/jira/browse/SPARK-33298
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.2.0
>
>
> This Jira is to propose a new version for `FileCommitProtocol` 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala]
>  ), e.g. `FileCommitProtocolV2`.
> The motivation is currently we have two requirements to change the API for 
> FileCommitProtocol:
> (1).Support write Hive ORC/Parquet bucketed table 
> ([https://github.com/apache/spark/pull/30003] ): need to add new parameter 
> `prefix` into method `newTaskTempFile` and `newTaskTempFileAbsPath`, to allow 
> spark writes hive/presto-compatible bucketed files.
> (2).Fix commit collision in dynamic partition overwrite mode 
> ([https://github.com/apache/spark/pull/29000] ): need to add new method 
> `getStagingDir` to allow customize dynamic partition staging directory to 
> avoid commit collision.
>  
> The reason to propose FileCommitProtocolV2 instead of changing 
> `FileCommitProtocol` directly, is that the API for FileCommitProtocolV2 is 
> kind of public where we allow customized commit protocol subclass to use 
> during run-time 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala#L146]
>  ). So if we change the API (e.g. adding method, or changing existing method 
> signature), it will break external subclass for the commit protocol. And we 
> are aware of some of external subclasses for better support of object store, 
> according to [~cloud_fan] .
>  
> One proposal for `FileCommitProtocolV2` can be:
> {code:java}
> abstract class FileCommitProtocolV2 {
>   // `options` to replace `ext`, where we can put more string-string 
> parameters
>   def newTaskTempFile(taskContext: TaskAttemptContext, dir: Option[String],
> options: Map[String, String]): String
>  
>   // `options` to replace `ext`, where we can put more string-string 
> parameters
>   def newTaskTempFileAbsPath(
> taskContext: TaskAttemptContext, absoluteDir: String, options: 
> Map[String, String]): String
>   // other new methods, e.g. getStagingDir
>   def getStagingDir(path: String, jobId: String): Path
>   // rest of FileCommitProtocol methods
>   ...
> }
> {code}
>  
> FileCommitProtocolV2.instantiate() logic will first try to find a subclass 
> for `FileCommitProtocolV2`, if not will find a subclass for 
> `FileCommitProtocol`, so the current version of `FileCommitProtocol` is still 
> supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35946) Respect Py4J server if InheritableThread API

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371791#comment-17371791
 ] 

Apache Spark commented on SPARK-35946:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33147

> Respect Py4J server if InheritableThread API
> 
>
> Key: SPARK-35946
> URL: https://issues.apache.org/jira/browse/SPARK-35946
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently we sets the enviornment variables at the client side of Py4J 
> (python/pyspark/util.py ). If the Py4J gateway is created somewhere else 
> (e.g., Zeppelin, etc), it could introduce a breakage at:
> {code}
> from pyspark import SparkContext
> jvm = SparkContext._jvm
> thread_connection = jvm._gateway_client.get_thread_connection()
> # ^ the MLlibMLflowIntegrationSuite test suite failed at this line
> # `AttributeError: 'GatewayClient' object has no attribute 
> 'get_thread_connection'`
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35946) Respect Py4J server if InheritableThread API

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35946:


Assignee: Apache Spark

> Respect Py4J server if InheritableThread API
> 
>
> Key: SPARK-35946
> URL: https://issues.apache.org/jira/browse/SPARK-35946
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently we sets the enviornment variables at the client side of Py4J 
> (python/pyspark/util.py ). If the Py4J gateway is created somewhere else 
> (e.g., Zeppelin, etc), it could introduce a breakage at:
> {code}
> from pyspark import SparkContext
> jvm = SparkContext._jvm
> thread_connection = jvm._gateway_client.get_thread_connection()
> # ^ the MLlibMLflowIntegrationSuite test suite failed at this line
> # `AttributeError: 'GatewayClient' object has no attribute 
> 'get_thread_connection'`
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35946) Respect Py4J server if InheritableThread API

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371790#comment-17371790
 ] 

Apache Spark commented on SPARK-35946:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33147

> Respect Py4J server if InheritableThread API
> 
>
> Key: SPARK-35946
> URL: https://issues.apache.org/jira/browse/SPARK-35946
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently we sets the enviornment variables at the client side of Py4J 
> (python/pyspark/util.py ). If the Py4J gateway is created somewhere else 
> (e.g., Zeppelin, etc), it could introduce a breakage at:
> {code}
> from pyspark import SparkContext
> jvm = SparkContext._jvm
> thread_connection = jvm._gateway_client.get_thread_connection()
> # ^ the MLlibMLflowIntegrationSuite test suite failed at this line
> # `AttributeError: 'GatewayClient' object has no attribute 
> 'get_thread_connection'`
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35946) Respect Py4J server if InheritableThread API

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35946:


Assignee: (was: Apache Spark)

> Respect Py4J server if InheritableThread API
> 
>
> Key: SPARK-35946
> URL: https://issues.apache.org/jira/browse/SPARK-35946
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently we sets the enviornment variables at the client side of Py4J 
> (python/pyspark/util.py ). If the Py4J gateway is created somewhere else 
> (e.g., Zeppelin, etc), it could introduce a breakage at:
> {code}
> from pyspark import SparkContext
> jvm = SparkContext._jvm
> thread_connection = jvm._gateway_client.get_thread_connection()
> # ^ the MLlibMLflowIntegrationSuite test suite failed at this line
> # `AttributeError: 'GatewayClient' object has no attribute 
> 'get_thread_connection'`
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35946) Respect Py4J server if InheritableThread API

2021-06-29 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-35946:


 Summary: Respect Py4J server if InheritableThread API
 Key: SPARK-35946
 URL: https://issues.apache.org/jira/browse/SPARK-35946
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


Currently we sets the enviornment variables at the client side of Py4J 
(python/pyspark/util.py ). If the Py4J gateway is created somewhere else (e.g., 
Zeppelin, etc), it could introduce a breakage at:

{code}
from pyspark import SparkContext
jvm = SparkContext._jvm
thread_connection = jvm._gateway_client.get_thread_connection()
# ^ the MLlibMLflowIntegrationSuite test suite failed at this line
# `AttributeError: 'GatewayClient' object has no attribute 
'get_thread_connection'`
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35912) [SQL] JSON read behavior is different depending on the cache setting when nullable is false.

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371787#comment-17371787
 ] 

Apache Spark commented on SPARK-35912:
--

User 'cfmcgrady' has created a pull request for this issue:
https://github.com/apache/spark/pull/33146

> [SQL] JSON read behavior is different depending on the cache setting when 
> nullable is false.
> 
>
> Key: SPARK-35912
> URL: https://issues.apache.org/jira/browse/SPARK-35912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Heedo Lee
>Priority: Minor
>
> Below is the reproduced code.
>  
> {code:java}
> import org.apache.spark.sql.Encoders
>  
> case class TestSchema(x: Int, y: Int)
> case class BaseSchema(value: TestSchema)
>  
> val schema = Encoders.product[BaseSchema].schema
> val testDS = Seq("""{"value":{"x":1}}""", """{"value":{"x":2}}""").toDS
> val jsonDS = spark.read.schema(schema).json(testDS)
> jsonDS.show
> +-+
> |value|
> +-+
> |{1, null}|
> |{2, null}|
> +-+
> jsonDS.cache.show
> +--+
> | value|
> +--+
> |{1, 0}|
> |{2, 0}|
> +--+
> {code}
>  
> The above result occurs when a schema is created with a nested StructType and 
> nullable of StructField is false.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35912) [SQL] JSON read behavior is different depending on the cache setting when nullable is false.

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35912:


Assignee: (was: Apache Spark)

> [SQL] JSON read behavior is different depending on the cache setting when 
> nullable is false.
> 
>
> Key: SPARK-35912
> URL: https://issues.apache.org/jira/browse/SPARK-35912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Heedo Lee
>Priority: Minor
>
> Below is the reproduced code.
>  
> {code:java}
> import org.apache.spark.sql.Encoders
>  
> case class TestSchema(x: Int, y: Int)
> case class BaseSchema(value: TestSchema)
>  
> val schema = Encoders.product[BaseSchema].schema
> val testDS = Seq("""{"value":{"x":1}}""", """{"value":{"x":2}}""").toDS
> val jsonDS = spark.read.schema(schema).json(testDS)
> jsonDS.show
> +-+
> |value|
> +-+
> |{1, null}|
> |{2, null}|
> +-+
> jsonDS.cache.show
> +--+
> | value|
> +--+
> |{1, 0}|
> |{2, 0}|
> +--+
> {code}
>  
> The above result occurs when a schema is created with a nested StructType and 
> nullable of StructField is false.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35912) [SQL] JSON read behavior is different depending on the cache setting when nullable is false.

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35912:


Assignee: Apache Spark

> [SQL] JSON read behavior is different depending on the cache setting when 
> nullable is false.
> 
>
> Key: SPARK-35912
> URL: https://issues.apache.org/jira/browse/SPARK-35912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Heedo Lee
>Assignee: Apache Spark
>Priority: Minor
>
> Below is the reproduced code.
>  
> {code:java}
> import org.apache.spark.sql.Encoders
>  
> case class TestSchema(x: Int, y: Int)
> case class BaseSchema(value: TestSchema)
>  
> val schema = Encoders.product[BaseSchema].schema
> val testDS = Seq("""{"value":{"x":1}}""", """{"value":{"x":2}}""").toDS
> val jsonDS = spark.read.schema(schema).json(testDS)
> jsonDS.show
> +-+
> |value|
> +-+
> |{1, null}|
> |{2, null}|
> +-+
> jsonDS.cache.show
> +--+
> | value|
> +--+
> |{1, 0}|
> |{2, 0}|
> +--+
> {code}
>  
> The above result occurs when a schema is created with a nested StructType and 
> nullable of StructField is false.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35912) [SQL] JSON read behavior is different depending on the cache setting when nullable is false.

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371786#comment-17371786
 ] 

Apache Spark commented on SPARK-35912:
--

User 'cfmcgrady' has created a pull request for this issue:
https://github.com/apache/spark/pull/33146

> [SQL] JSON read behavior is different depending on the cache setting when 
> nullable is false.
> 
>
> Key: SPARK-35912
> URL: https://issues.apache.org/jira/browse/SPARK-35912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Heedo Lee
>Priority: Minor
>
> Below is the reproduced code.
>  
> {code:java}
> import org.apache.spark.sql.Encoders
>  
> case class TestSchema(x: Int, y: Int)
> case class BaseSchema(value: TestSchema)
>  
> val schema = Encoders.product[BaseSchema].schema
> val testDS = Seq("""{"value":{"x":1}}""", """{"value":{"x":2}}""").toDS
> val jsonDS = spark.read.schema(schema).json(testDS)
> jsonDS.show
> +-+
> |value|
> +-+
> |{1, null}|
> |{2, null}|
> +-+
> jsonDS.cache.show
> +--+
> | value|
> +--+
> |{1, 0}|
> |{2, 0}|
> +--+
> {code}
>  
> The above result occurs when a schema is created with a nested StructType and 
> nullable of StructField is false.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24540) Support for multiple character delimiter in Spark CSV read

2021-06-29 Thread Chandra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371780#comment-17371780
 ] 

Chandra commented on SPARK-24540:
-

My requirement is to process the file which have a multi character row and 
column delimiter.

I tried multiple options but ended up with few issues.

 

File sample:

127'~'127433'~''~''~'2'~'ICR'~'STDLONG'~'NR'~'NR'~'1997-06-25 
14:47:37'~''~'NR'~''~'1997-06-25 14:47:37'~'BBB'~''~'Stable'~''~''~''~'Not 
Rated'~'CreditWatch/Outlook'~'OL'~''~''~''~'#@#@#152'~'308044'~''~''~'2'~'ICR'~'FCLONG'~'NR'~'NR'~'1997-12-05
 14:23:33'~'NM'~'NR'~'1997-12-05 14:23:33'~'1997-12-05 14:23:33'~'B+'~'Watch 
Pos'~'NM'~''~''~''~'Not 
Rated'~'CreditWatch/Outlook'~'OL'~''~''~''~'#@#@#155'~'308044'~''~''~'2'~'ICR'~'STDLONG'~'NR'~'NR'~'1997-12-05
 14:23:34'~'NM'~'NR'~'1997-12-05 14:23:34'~'1997-12-05 14:23:34'~'B+'~'Watch 
Pos'~'NM'~''~''~''~'Not 
Rated'~'CreditWatch/Outlook'~'OL'~''~''~''~'#@#@#282'~'127812'~''~''~'2'~'ICR'~'FCLONG'~'NR'~'NR'~'1998-11-06
 14:45:54'~'NM'~'NR'~'1998-11-06 14:45:54'~'1998-11-06 14:45:54'~'B+'~'Watch 
Pos'~'NM'~''~''~''~'Not 
Rated'~'CreditWatch/Outlook'~'OL'~''~''~''~'#@#@#287'~'127812'~''~''~'2'~'ICR'~'STDLONG'~'NR'~'NR'~'1998-11-06
 14:45:54'~'NM'~'NR'~'1998-11-06 14:45:54'~'1998-11-06 14:45:54'~'B+'~'Watch 
Pos'~'NM'~''~''~''~'Not 
Rated'~'CreditWatch/Outlook'~'OL'~''~''~''~'#@#@#294'~'100899'~''~''~'2'~'ICR'~'FCLONG'~'NR'~'NR'~'1996-08-01
 17:58:09'~'NM'~'NR'~'1996-08-01 17:58:09'~'1996-08-01 17:58:09'~'BB-'~'Watch 
Neg'~'NM'~''~''~''~'Not 
Rated'~'CreditWatch/Outlook'~'OL'~''~''~''~'#@#@#303'~'100899'~''~''~'2'~'ICR'~'STDLONG'~'NR'~'NR'~'1996-08-01
 17:58:09'~'NM'~'NR'~'1996-08-01 17:58:09'~'1996-08-01 17:58:09'~'BB-'~'Watch 
Neg'~'NM'~''~''~''~'Not 
Rated'~'CreditWatch/Outlook'~'OL'~''~''~''~'#@#@#927'~'104464'~''~''~'2'~'ICR'~'STDLONG'~'NR'~'NR'~'1997-05-13
 14:45:30'~''~'NR'~''~'1997-05-13 14:45:30'~'A'~''~'Stable'~''~''~''~'Not 
Rated'~'CreditWatch/Outlook'~'OL'~''~''~''~'#@#@#    

 

Row delimiter is :  #@#@#     COlumn Delimiter:   '~'   

Code:

df2 = spark.read.load("spRatingData_sample.txt",
format="csv",
sep="'~'",
lineSep="#@#@#")
print("two.csv rowcount: {}".format(df2.count()))

 

ERROR:

: java.lang.IllegalArgumentException: Delimiter cannot be more than one 
character: '~'
at 
org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:118)
at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:87)
at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:45)
at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:58)
at 
org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:183)
at scala.Option.orElse(Option.scala:447)
at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:180)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 4, in 
File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 166, in load
return self._df(self._jreader.load(path))
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
1257, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: "Delimiter cannot be more than one 
character: '~'"

> Support for multiple character delimiter in Spark CSV read
> --
>
> Key: SPARK-24540
> URL: https://issues.apache.org/jira/browse/SPARK-24540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Ashwin K
>Assignee: Jeff Evans
>   

[jira] [Updated] (SPARK-35945) Unable to parse multi character row and column delimited files using Spark

2021-06-29 Thread Chandra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandra updated SPARK-35945:

Description: 
My requirement is to process the file which have a multi character row and 
column delimiter.

I tried multiple options but ended up with few issues.

 

File sample:

127'~'127433'~''~''~'2'~'ICR'~'STDLONG'~'NR'~'NR'~'1997-06-25 
14:47:37'~''~'NR'~''~'1997-06-25 14:47:37'~'BBB'~''~'Stable'~''~''~''~'Not 
Rated'~'CreditWatch/Outlook'~'OL'~''~''~''~'#@#@#152'~'308044'~''~''~'2'~'ICR'~'FCLONG'~'NR'~'NR'~'1997-12-05
 14:23:33'~'NM'~'NR'~'1997-12-05 14:23:33'~'1997-12-05 14:23:33'~'B+'~'Watch 
Pos'~'NM'~''~''~''~'Not 
Rated'~'CreditWatch/Outlook'~'OL'~''~''~''~'#@#@#155'~'308044'~''~''~'2'~'ICR'~'STDLONG'~'NR'~'NR'~'1997-12-05
 14:23:34'~'NM'~'NR'~'1997-12-05 14:23:34'~'1997-12-05 14:23:34'~'B+'~'Watch 
Pos'~'NM'~''~''~''~'Not 
Rated'~'CreditWatch/Outlook'~'OL'~''~''~''~'#@#[~infrabot]

 

Row delimiter is :  #@#@#     COlumn Delimiter:   '~'   

Code:

df2 = spark.read.load("spRatingData_sample.txt",
 format="csv", 
 sep="'~'",
 lineSep="#@#@#")
 print("two.csv rowcount: {}".format(df2.count()))

 

ERROR:

: java.lang.IllegalArgumentException: Delimiter cannot be more than one 
character: '~'
 at 
org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:118)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:87)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:45)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:58)
 at 
org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:183)
 at scala.Option.orElse(Option.scala:447)
 at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:180)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
 at py4j.Gateway.invoke(Gateway.java:282)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:238)
 at java.lang.Thread.run(Thread.java:748)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "", line 4, in 
 File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 166, in load
 return self._df(self._jreader.load(path))
 File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
line 1257, in __call__
 File "/usr/lib/spark/python/pyspark/sql/utils.py", line 79, in deco
 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
 pyspark.sql.utils.IllegalArgumentException: "Delimiter cannot be more than one 
character: '~'"

  was:
My requirement is to process the file which have a multi character row and 
column delimiter.

I tried multiple options but ended up with few issues.

 

File sample:

127'~'127433'~''~''~'2'~'ICR'~'STDLONG'~'NR'~'NR'~'1997-06-25 
14:47:37'~''~'NR'~''~'1997-06-25 14:47:37'~'BBB'~''~'Stable'~''~''~''~'Not 
Rated'~'CreditWatch/Outlook'~'OL'~''~''~''~'#@#@#152'~'308044'~''~''~'2'~'ICR'~'FCLONG'~'NR'~'NR'~'1997-12-05
 14:23:33'~'NM'~'NR'~'1997-12-05 14:23:33'~'1997-12-05 14:23:33'~'B+'~'Watch 
Pos'~'NM'~''~''~''~'Not 
Rated'~'CreditWatch/Outlook'~'OL'~''~''~''~'#@#@#155'~'308044'~''~''~'2'~'ICR'~'STDLONG'~'NR'~'NR'~'1997-12-05
 14:23:34'~'NM'~'NR'~'1997-12-05 14:23:34'~'1997-12-05 14:23:34'~'B+'~'Watch 
Pos'~'NM'~''~''~''~'Not 
Rated'~'CreditWatch/Outlook'~'OL'~''~''~''~'#@#[~infrabot]

 

Code:

df2 = spark.read.load("spRatingData_sample.txt",
 format="csv", 
 sep="'~'",
 lineSep="#@#@#")
print("two.csv rowcount: {}".format(df2.count()))

 

ERROR:

: java.lang.IllegalArgumentException: Delimiter cannot be more than one 
character: '~'
 at 
org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:118)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:87)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:45)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:58)
 at 

[jira] [Created] (SPARK-35945) Unable to parse multi character row and column delimited files using Spark

2021-06-29 Thread Chandra (Jira)
Chandra created SPARK-35945:
---

 Summary: Unable to parse  multi character row and column delimited 
files using Spark
 Key: SPARK-35945
 URL: https://issues.apache.org/jira/browse/SPARK-35945
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.4.4
 Environment: development
Reporter: Chandra


My requirement is to process the file which have a multi character row and 
column delimiter.

I tried multiple options but ended up with few issues.

 

File sample:

127'~'127433'~''~''~'2'~'ICR'~'STDLONG'~'NR'~'NR'~'1997-06-25 
14:47:37'~''~'NR'~''~'1997-06-25 14:47:37'~'BBB'~''~'Stable'~''~''~''~'Not 
Rated'~'CreditWatch/Outlook'~'OL'~''~''~''~'#@#@#152'~'308044'~''~''~'2'~'ICR'~'FCLONG'~'NR'~'NR'~'1997-12-05
 14:23:33'~'NM'~'NR'~'1997-12-05 14:23:33'~'1997-12-05 14:23:33'~'B+'~'Watch 
Pos'~'NM'~''~''~''~'Not 
Rated'~'CreditWatch/Outlook'~'OL'~''~''~''~'#@#@#155'~'308044'~''~''~'2'~'ICR'~'STDLONG'~'NR'~'NR'~'1997-12-05
 14:23:34'~'NM'~'NR'~'1997-12-05 14:23:34'~'1997-12-05 14:23:34'~'B+'~'Watch 
Pos'~'NM'~''~''~''~'Not 
Rated'~'CreditWatch/Outlook'~'OL'~''~''~''~'#@#[~infrabot]

 

Code:

df2 = spark.read.load("spRatingData_sample.txt",
 format="csv", 
 sep="'~'",
 lineSep="#@#@#")
print("two.csv rowcount: {}".format(df2.count()))

 

ERROR:

: java.lang.IllegalArgumentException: Delimiter cannot be more than one 
character: '~'
 at 
org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:118)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:87)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:45)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:58)
 at 
org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:183)
 at scala.Option.orElse(Option.scala:447)
 at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:180)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
 at py4j.Gateway.invoke(Gateway.java:282)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:238)
 at java.lang.Thread.run(Thread.java:748)


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "", line 4, in 
 File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 166, in load
 return self._df(self._jreader.load(path))
 File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
line 1257, in __call__
 File "/usr/lib/spark/python/pyspark/sql/utils.py", line 79, in deco
 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: "Delimiter cannot be more than one 
character: '~'"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35943) Introduce Axis type alias.

2021-06-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35943.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33144
[https://github.com/apache/spark/pull/33144]

> Introduce Axis type alias.
> --
>
> Key: SPARK-35943
> URL: https://issues.apache.org/jira/browse/SPARK-35943
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34549) Upgrade aws kinesis to 1.14.0 and java sdk 1.11.844

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34549:


Assignee: (was: Apache Spark)

> Upgrade aws kinesis to 1.14.0 and java sdk 1.11.844
> ---
>
> Key: SPARK-34549
> URL: https://issues.apache.org/jira/browse/SPARK-34549
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Upgrade aws kinesis and java sdk to support to catch up minimum requirement 
> for IAM role for service accounts: 
> https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-minimum-sdk.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34549) Upgrade aws kinesis to 1.14.0 and java sdk 1.11.844

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34549:


Assignee: Apache Spark

> Upgrade aws kinesis to 1.14.0 and java sdk 1.11.844
> ---
>
> Key: SPARK-34549
> URL: https://issues.apache.org/jira/browse/SPARK-34549
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Upgrade aws kinesis and java sdk to support to catch up minimum requirement 
> for IAM role for service accounts: 
> https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-minimum-sdk.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35943) Introduce Axis type alias.

2021-06-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35943:


Assignee: Takuya Ueshin

> Introduce Axis type alias.
> --
>
> Key: SPARK-35943
> URL: https://issues.apache.org/jira/browse/SPARK-35943
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-34549) Upgrade aws kinesis to 1.14.0 and java sdk 1.11.844

2021-06-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-34549:
--
  Assignee: (was: L. C. Hsieh)

Reverted at 
https://github.com/apache/spark/commit/7ad682aaa169983f60c0e48547825ed2b255777e

> Upgrade aws kinesis to 1.14.0 and java sdk 1.11.844
> ---
>
> Key: SPARK-34549
> URL: https://issues.apache.org/jira/browse/SPARK-34549
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> Upgrade aws kinesis and java sdk to support to catch up minimum requirement 
> for IAM role for service accounts: 
> https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-minimum-sdk.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34549) Upgrade aws kinesis to 1.14.0 and java sdk 1.11.844

2021-06-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34549:
-
Fix Version/s: (was: 3.2.0)

> Upgrade aws kinesis to 1.14.0 and java sdk 1.11.844
> ---
>
> Key: SPARK-34549
> URL: https://issues.apache.org/jira/browse/SPARK-34549
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Upgrade aws kinesis and java sdk to support to catch up minimum requirement 
> for IAM role for service accounts: 
> https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-minimum-sdk.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35925) Support DayTimeIntervalType in width-bucket function

2021-06-29 Thread PengLei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PengLei updated SPARK-35925:

Fix Version/s: (was: 3.2.0)

> Support DayTimeIntervalType in width-bucket function
> 
>
> Key: SPARK-35925
> URL: https://issues.apache.org/jira/browse/SPARK-35925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: PengLei
>Priority: Major
>
> At now, width-bucket support the type [DoubleType, DoubleType, DoubleType, 
> LongType],
> we hope that support[DayTimeIntervaType, DayTimeIntervaType, 
> DayTimeIntervaType, LongType]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35914) Driver can't distribute task to executor because NullPointerException

2021-06-29 Thread Helt Long (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Helt Long updated SPARK-35914:
--
Environment: 
hadoop 2.6.0-cdh5.7.1

Spark 3.0.1, 3.1.1, 3.1.2

  was:
CDH 5.7.1: Hadoop 2.6.5

Spark 3.0.1, 3.1.1, 3.1.2


> Driver can't distribute task to executor because NullPointerException
> -
>
> Key: SPARK-35914
> URL: https://issues.apache.org/jira/browse/SPARK-35914
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.1, 3.1.2
> Environment: hadoop 2.6.0-cdh5.7.1
> Spark 3.0.1, 3.1.1, 3.1.2
>Reporter: Helt Long
>Priority: Major
> Attachments: stuck log.png, webui stuck.png
>
>
> When use spark3 submit a spark job to yarn cluster, I get a problem. Once in 
> a while, driver can't distribute any tasks to any executors, and the stage 
> will stuck , the total spark job will stuck. Check driver log, I found 
> NullPointerException. It's like a netty problem, I can confirm this problem 
> only exist in spark3, because I use spark2 never happend.
>  
> {code:java}
> // Error message
> 21/06/28 14:42:43 INFO TaskSetManager: Starting task 2592.0 in stage 1.0 (TID 
> 3494) (worker39.hadoop, executor 84, partition 2592, RACK_LOCAL, 5006 bytes) 
> taskResourceAssignments Map()
> 21/06/28 14:42:43 INFO TaskSetManager: Finished task 4155.0 in stage 1.0 (TID 
> 3367) in 36670 ms on worker39.hadoop (executor 84) (3278/4249)
> 21/06/28 14:42:43 INFO TaskSetManager: Finished task 2283.0 in stage 1.0 (TID 
> 3422) in 22371 ms on worker15.hadoop (executor 109) (3279/4249)
> 21/06/28 14:42:43 ERROR Inbox: Ignoring error
> java.lang.NullPointerException
>   at java.lang.String.length(String.java:623)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:420)
>   at java.lang.StringBuilder.append(StringBuilder.java:136)
>   at 
> org.apache.spark.scheduler.TaskSetManager.$anonfun$resourceOffer$5(TaskSetManager.scala:483)
>   at org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
>   at org.apache.spark.internal.Logging.logInfo$(Logging.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSetManager.logInfo(TaskSetManager.scala:54)
>   at 
> org.apache.spark.scheduler.TaskSetManager.$anonfun$resourceOffer$2(TaskSetManager.scala:484)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:444)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$2(TaskSchedulerImpl.scala:397)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$2$adapted(TaskSchedulerImpl.scala:392)
>   at scala.Option.foreach(Option.scala:407)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$1(TaskSchedulerImpl.scala:392)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:383)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$20(TaskSchedulerImpl.scala:581)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$20$adapted(TaskSchedulerImpl.scala:576)
>   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$16(TaskSchedulerImpl.scala:576)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$16$adapted(TaskSchedulerImpl.scala:547)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:547)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.$anonfun$makeOffers$5(CoarseGrainedSchedulerBackend.scala:340)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$$withLock(CoarseGrainedSchedulerBackend.scala:904)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:332)
>   at 
> 

[jira] [Commented] (SPARK-35914) Driver can't distribute task to executor because NullPointerException

2021-06-29 Thread Helt Long (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371758#comment-17371758
 ] 

Helt Long commented on SPARK-35914:
---

Do what you wanna do, [~code_kr_dev_s], but this problem is difficult to 
recurrent, it's random occurrence. If you need any help, tell me!

> Driver can't distribute task to executor because NullPointerException
> -
>
> Key: SPARK-35914
> URL: https://issues.apache.org/jira/browse/SPARK-35914
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.1, 3.1.2
> Environment: CDH 5.7.1: Hadoop 2.6.5
> Spark 3.0.1, 3.1.1, 3.1.2
>Reporter: Helt Long
>Priority: Major
> Attachments: stuck log.png, webui stuck.png
>
>
> When use spark3 submit a spark job to yarn cluster, I get a problem. Once in 
> a while, driver can't distribute any tasks to any executors, and the stage 
> will stuck , the total spark job will stuck. Check driver log, I found 
> NullPointerException. It's like a netty problem, I can confirm this problem 
> only exist in spark3, because I use spark2 never happend.
>  
> {code:java}
> // Error message
> 21/06/28 14:42:43 INFO TaskSetManager: Starting task 2592.0 in stage 1.0 (TID 
> 3494) (worker39.hadoop, executor 84, partition 2592, RACK_LOCAL, 5006 bytes) 
> taskResourceAssignments Map()
> 21/06/28 14:42:43 INFO TaskSetManager: Finished task 4155.0 in stage 1.0 (TID 
> 3367) in 36670 ms on worker39.hadoop (executor 84) (3278/4249)
> 21/06/28 14:42:43 INFO TaskSetManager: Finished task 2283.0 in stage 1.0 (TID 
> 3422) in 22371 ms on worker15.hadoop (executor 109) (3279/4249)
> 21/06/28 14:42:43 ERROR Inbox: Ignoring error
> java.lang.NullPointerException
>   at java.lang.String.length(String.java:623)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:420)
>   at java.lang.StringBuilder.append(StringBuilder.java:136)
>   at 
> org.apache.spark.scheduler.TaskSetManager.$anonfun$resourceOffer$5(TaskSetManager.scala:483)
>   at org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
>   at org.apache.spark.internal.Logging.logInfo$(Logging.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSetManager.logInfo(TaskSetManager.scala:54)
>   at 
> org.apache.spark.scheduler.TaskSetManager.$anonfun$resourceOffer$2(TaskSetManager.scala:484)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:444)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$2(TaskSchedulerImpl.scala:397)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$2$adapted(TaskSchedulerImpl.scala:392)
>   at scala.Option.foreach(Option.scala:407)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$1(TaskSchedulerImpl.scala:392)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:383)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$20(TaskSchedulerImpl.scala:581)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$20$adapted(TaskSchedulerImpl.scala:576)
>   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$16(TaskSchedulerImpl.scala:576)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$16$adapted(TaskSchedulerImpl.scala:547)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:547)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.$anonfun$makeOffers$5(CoarseGrainedSchedulerBackend.scala:340)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$$withLock(CoarseGrainedSchedulerBackend.scala:904)
>   at 
> 

[jira] [Commented] (SPARK-35926) Support YearMonthIntervalType in width-bucket function

2021-06-29 Thread PengLei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371756#comment-17371756
 ] 

PengLei commented on SPARK-35926:
-

[~code_kr_dev_s] Sorry, I will do it after 3.2 release.

> Support YearMonthIntervalType in width-bucket function
> --
>
> Key: SPARK-35926
> URL: https://issues.apache.org/jira/browse/SPARK-35926
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: PengLei
>Priority: Major
>
> At now, width-bucket support the type [DoubleType, DoubleType, DoubleType, 
> LongType],
> we hope that support[YearMonthIntervalType, YearMonthIntervalType, 
> YearMonthIntervalType, LongType]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34549) Upgrade aws kinesis to 1.14.0 and java sdk 1.11.844

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371755#comment-17371755
 ] 

Apache Spark commented on SPARK-34549:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33145

> Upgrade aws kinesis to 1.14.0 and java sdk 1.11.844
> ---
>
> Key: SPARK-34549
> URL: https://issues.apache.org/jira/browse/SPARK-34549
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> Upgrade aws kinesis and java sdk to support to catch up minimum requirement 
> for IAM role for service accounts: 
> https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-minimum-sdk.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34549) Upgrade aws kinesis to 1.14.0 and java sdk 1.11.844

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371754#comment-17371754
 ] 

Apache Spark commented on SPARK-34549:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33145

> Upgrade aws kinesis to 1.14.0 and java sdk 1.11.844
> ---
>
> Key: SPARK-34549
> URL: https://issues.apache.org/jira/browse/SPARK-34549
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> Upgrade aws kinesis and java sdk to support to catch up minimum requirement 
> for IAM role for service accounts: 
> https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-minimum-sdk.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35873) Cleanup the version logic from the pandas API on Spark

2021-06-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35873:


Assignee: Haejoon Lee

> Cleanup the version logic from the pandas API on Spark
> --
>
> Key: SPARK-35873
> URL: https://issues.apache.org/jira/browse/SPARK-35873
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> There are version checking logic in the pyspark.pandas' __init__.py script, 
> but we don't need to have this anymore since it's ported into PySpark, and we 
> should follow the PySpark's version check logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35873) Cleanup the version logic from the pandas API on Spark

2021-06-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35873.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33128
[https://github.com/apache/spark/pull/33128]

> Cleanup the version logic from the pandas API on Spark
> --
>
> Key: SPARK-35873
> URL: https://issues.apache.org/jira/browse/SPARK-35873
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.2.0
>
>
> There are version checking logic in the pyspark.pandas' __init__.py script, 
> but we don't need to have this anymore since it's ported into PySpark, and we 
> should follow the PySpark's version check logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35784) Implementation for RocksDB instance

2021-06-29 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-35784.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32928
[https://github.com/apache/spark/pull/32928]

> Implementation for RocksDB instance
> ---
>
> Key: SPARK-35784
> URL: https://issues.apache.org/jira/browse/SPARK-35784
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.2.0
>
>
> The implementation for the RocksDB instance which is used in the RocksDB 
> state store. It plays a role as a handler for the RocksDB instance and 
> RocksDBFileManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35784) Implementation for RocksDB instance

2021-06-29 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-35784:
---

Assignee: Yuanjian Li

> Implementation for RocksDB instance
> ---
>
> Key: SPARK-35784
> URL: https://issues.apache.org/jira/browse/SPARK-35784
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>
> The implementation for the RocksDB instance which is used in the RocksDB 
> state store. It plays a role as a handler for the RocksDB instance and 
> RocksDBFileManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35721) Path level discover for python unittests

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35721:


Assignee: (was: Apache Spark)

> Path level discover for python unittests
> 
>
> Key: SPARK-35721
> URL: https://issues.apache.org/jira/browse/SPARK-35721
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> Now we need to specify the python test cases by manually when we add a new 
> testcase. Sometime, we forgot to add the testcase to module list, the 
> testcase would not be executed.
> Such as:
>  * pyspark-core pyspark.tests.test_pin_thread
> Thus we need some auto-discover way to find all testcase rather than 
> specified every case by manually.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35721) Path level discover for python unittests

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35721:


Assignee: Apache Spark

> Path level discover for python unittests
> 
>
> Key: SPARK-35721
> URL: https://issues.apache.org/jira/browse/SPARK-35721
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>
> Now we need to specify the python test cases by manually when we add a new 
> testcase. Sometime, we forgot to add the testcase to module list, the 
> testcase would not be executed.
> Such as:
>  * pyspark-core pyspark.tests.test_pin_thread
> Thus we need some auto-discover way to find all testcase rather than 
> specified every case by manually.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35943) Introduce Axis type alias.

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371738#comment-17371738
 ] 

Apache Spark commented on SPARK-35943:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/33144

> Introduce Axis type alias.
> --
>
> Key: SPARK-35943
> URL: https://issues.apache.org/jira/browse/SPARK-35943
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35943) Introduce Axis type alias.

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35943:


Assignee: Apache Spark

> Introduce Axis type alias.
> --
>
> Key: SPARK-35943
> URL: https://issues.apache.org/jira/browse/SPARK-35943
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35943) Introduce Axis type alias.

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35943:


Assignee: (was: Apache Spark)

> Introduce Axis type alias.
> --
>
> Key: SPARK-35943
> URL: https://issues.apache.org/jira/browse/SPARK-35943
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35943) Introduce Axis type alias.

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371737#comment-17371737
 ] 

Apache Spark commented on SPARK-35943:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/33144

> Introduce Axis type alias.
> --
>
> Key: SPARK-35943
> URL: https://issues.apache.org/jira/browse/SPARK-35943
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35933) PartitionFilters and pushFilters not applied to window functions

2021-06-29 Thread Shreyas Kothavade (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shreyas Kothavade updated SPARK-35933:
--
Affects Version/s: 2.4.8

> PartitionFilters and pushFilters not applied to window functions
> 
>
> Key: SPARK-35933
> URL: https://issues.apache.org/jira/browse/SPARK-35933
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.1.2
>Reporter: Shreyas Kothavade
>Priority: Major
>
> Spark does not apply partition and pushed filters when the partition by 
> column and window function partition columns are not the same. For example, 
> in the code below, the data frame is created with a partition on "id". And I 
> use the partitioned data frame to calculate lag which is partitioned by 
> "name". In this case, the query plan shows the partitionFilters and pushed 
> Filters as empty.
> {code:java}
> spark
>   .createDataFrame(
>     Seq(
>       Person(
>         1,
>         "Andy",
>         new Timestamp(1499955986039L),
>         new Timestamp(1499955982342L)
>       ),
>       Person(
>         2,
>         "Jeff",
>         new Timestamp(1499955986339L),
>         new Timestamp(149995598L)
>       )
>     )
>   )
>  .write
>   .partitionBy("id")
>   .mode(SaveMode.Append)
>   .parquet("spark-warehouse/people")
> val dfPeople =
>   spark.read.parquet("spark-warehouse/people")
> dfPeople
>   .select(
>     $"id",
>     $"name",
>     lag(col("ts2"), 1).over(Window.partitionBy("name").orderBy("ts"))
>   )
>   .filter($"id" === 1)
>   .explain()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35933) PartitionFilters and pushFilters not applied to window functions

2021-06-29 Thread Shreyas Kothavade (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shreyas Kothavade updated SPARK-35933:
--
Description: 
Spark does not apply partition and pushed filters when the partition by column 
and window function partition columns are not the same. For example, in the 
code below, the data frame is created with a partition on "id". And I use the 
partitioned data frame to calculate lag which is partitioned by "name". In this 
case, the query plan shows the partitionFilters and pushed Filters as empty.
{code:java}
spark
  .createDataFrame(
    Seq(
      Person(
        1,
        "Andy",
        new Timestamp(1499955986039L),
        new Timestamp(1499955982342L)
      ),
      Person(
        2,
        "Jeff",
        new Timestamp(1499955986339L),
        new Timestamp(149995598L)
      )
    )
  )
 .write
  .partitionBy("id")
  .mode(SaveMode.Append)
  .parquet("spark-warehouse/people")

val dfPeople =
  spark.read.parquet("spark-warehouse/people")
dfPeople
  .select(
    $"id",
    $"name",
    lag(col("ts2"), 1).over(Window.partitionBy("name").orderBy("ts"))
  )
  .filter($"id" === 1)
  .explain()
{code}

  was:
Spark does not apply partition and pushed filters when the partition by column 
and window function partition columns are not the same. For example, in the 
code below, the data frame is created with a partition on "id". And I use the 
partitioned data frame to calculate lag which is partitioned by "name". In this 
case, the query plan shows the partitionFilters and pushed Filters as empty.
{code:java}
spark
  .createDataFrame(
    Seq(
      Person(
        1,
        "Andy",
        new Timestamp(1499955986039L),
        new Timestamp(1499955982342L)
      ),
      Person(
        2,
        "Jeff",
        new Timestamp(1499955986339L),
        new Timestamp(149995598L)
      )
    )
  )
  .write
  .partitionBy("id")
  .mode(SaveMode.Append)
  .parquet("spark-warehouse/people")

val dfPeople =
  spark.read.parquet("spark-warehouse/people")
dfPeople
  .select(
    $"id",
    $"name",
    lag(col("ts2"), 1).over(Window.partitionBy("name").orderBy("ts"))
  )
  .filter($"name" === "Jeff")
  .explain()
{code}


> PartitionFilters and pushFilters not applied to window functions
> 
>
> Key: SPARK-35933
> URL: https://issues.apache.org/jira/browse/SPARK-35933
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Shreyas Kothavade
>Priority: Major
>
> Spark does not apply partition and pushed filters when the partition by 
> column and window function partition columns are not the same. For example, 
> in the code below, the data frame is created with a partition on "id". And I 
> use the partitioned data frame to calculate lag which is partitioned by 
> "name". In this case, the query plan shows the partitionFilters and pushed 
> Filters as empty.
> {code:java}
> spark
>   .createDataFrame(
>     Seq(
>       Person(
>         1,
>         "Andy",
>         new Timestamp(1499955986039L),
>         new Timestamp(1499955982342L)
>       ),
>       Person(
>         2,
>         "Jeff",
>         new Timestamp(1499955986339L),
>         new Timestamp(149995598L)
>       )
>     )
>   )
>  .write
>   .partitionBy("id")
>   .mode(SaveMode.Append)
>   .parquet("spark-warehouse/people")
> val dfPeople =
>   spark.read.parquet("spark-warehouse/people")
> dfPeople
>   .select(
>     $"id",
>     $"name",
>     lag(col("ts2"), 1).over(Window.partitionBy("name").orderBy("ts"))
>   )
>   .filter($"id" === 1)
>   .explain()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35337) pandas API on Spark: Separate basic operations into data type based structures

2021-06-29 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-35337:
--
Summary: pandas API on Spark: Separate basic operations into data type 
based structures  (was: pandas APIs on Spark: Separate basic operations into 
data type based structures)

> pandas API on Spark: Separate basic operations into data type based structures
> --
>
> Key: SPARK-35337
> URL: https://issues.apache.org/jira/browse/SPARK-35337
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Currently, the same basic operation of all data types is defined in one 
> function, so it’s difficult to extend the behavior change based on the data 
> types. For example, the binary operation Series + Series behaves differently 
> based on the data type, e.g., just adding for numerical operands, 
> concatenating for string operands, etc. The behavior difference is done by 
> if-else in the function, so it’s messy and difficult to maintain or reuse the 
> logic.
> We should provide an infrastructure to manage the differences in these 
> operations.
> Please refer to [pandas APIs on Spark: Separate basic operations into data 
> type based 
> structures|https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing]
>  for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35464) pandas API on Spark: Enable mypy check "disallow_untyped_defs" for main codes.

2021-06-29 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-35464:
--
Summary: pandas API on Spark: Enable mypy check "disallow_untyped_defs" for 
main codes.  (was: pandas APIs on Spark: Enable mypy check 
"disallow_untyped_defs" for main codes.)

> pandas API on Spark: Enable mypy check "disallow_untyped_defs" for main codes.
> --
>
> Key: SPARK-35464
> URL: https://issues.apache.org/jira/browse/SPARK-35464
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently many functions in the main codes are still missing type annotations 
> and disabled {{mypy}} check "disallow_untyped_defs".
> We should add more type annotations and enable the {{mypy}} check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35944) Introduce type aliases for names or labels.

2021-06-29 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-35944:
-

 Summary: Introduce type aliases for names or labels.
 Key: SPARK-35944
 URL: https://issues.apache.org/jira/browse/SPARK-35944
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35942) Refine Scalar type alias and reuse it

2021-06-29 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-35942:
-

 Summary: Refine Scalar type alias and reuse it
 Key: SPARK-35942
 URL: https://issues.apache.org/jira/browse/SPARK-35942
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35943) Introduce Axis type alias.

2021-06-29 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-35943:
-

 Summary: Introduce Axis type alias.
 Key: SPARK-35943
 URL: https://issues.apache.org/jira/browse/SPARK-35943
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35941) pandas API on Spark: Improve type hints.

2021-06-29 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-35941:
-

 Summary: pandas API on Spark: Improve type hints.
 Key: SPARK-35941
 URL: https://issues.apache.org/jira/browse/SPARK-35941
 Project: Spark
  Issue Type: Umbrella
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32922) Add support for ShuffleBlockFetcherIterator to read from merged shuffle partitions and to fallback to original shuffle blocks if encountering failures

2021-06-29 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-32922.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32140
[https://github.com/apache/spark/pull/32140]

> Add support for ShuffleBlockFetcherIterator to read from merged shuffle 
> partitions and to fallback to original shuffle blocks if encountering failures
> --
>
> Key: SPARK-32922
> URL: https://issues.apache.org/jira/browse/SPARK-32922
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.2.0
>
>
> With the extended MapOutputTracker, the reducers can now get the task input 
> data from the merged shuffle partitions for more efficient shuffle data 
> fetch. The reducers should also be able to fallback to fetching the original 
> unmarked blocks if it encounters failures when fetching the merged shuffle 
> partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32922) Add support for ShuffleBlockFetcherIterator to read from merged shuffle partitions and to fallback to original shuffle blocks if encountering failures

2021-06-29 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-32922:
---

Assignee: Chandni Singh

> Add support for ShuffleBlockFetcherIterator to read from merged shuffle 
> partitions and to fallback to original shuffle blocks if encountering failures
> --
>
> Key: SPARK-32922
> URL: https://issues.apache.org/jira/browse/SPARK-32922
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Chandni Singh
>Priority: Major
>
> With the extended MapOutputTracker, the reducers can now get the task input 
> data from the merged shuffle partitions for more efficient shuffle data 
> fetch. The reducers should also be able to fallback to fetching the original 
> unmarked blocks if it encounters failures when fetching the merged shuffle 
> partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35921) ${spark.yarn.isHadoopProvided} in config.properties is not edited if build with SBT

2021-06-29 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-35921.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33121
[https://github.com/apache/spark/pull/33121]

> ${spark.yarn.isHadoopProvided} in config.properties is not edited if build 
> with SBT
> ---
>
> Key: SPARK-35921
> URL: https://issues.apache.org/jira/browse/SPARK-35921
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> yarn sub-module contains config.properties.
> {code}
> spark.yarn.isHadoopProvided = ${spark.yarn.isHadoopProvided}
> {code}
> The ${spark.yarn.isHadoopProvided} part is replaced with true or false in 
> build depending on whether Hadoop is provided or not (specified by 
> -Phadoop-provided).
> The edited config.properties will be loaded at runtime to control how to 
> populate Hadoop-related classpath.
> If we build with Maven, these process works but doesn't with SBT.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33995) Make datetime addition easier for years, weeks, hours, minutes, and seconds

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371611#comment-17371611
 ] 

Apache Spark commented on SPARK-33995:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/33143

> Make datetime addition easier for years, weeks, hours, minutes, and seconds
> ---
>
> Key: SPARK-33995
> URL: https://issues.apache.org/jira/browse/SPARK-33995
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Matthew Powers
>Assignee: Matthew Powers
>Priority: Minor
> Fix For: 3.2.0
>
>
> There are add_months and date_add functions that make it easy to perform 
> datetime addition with months and days, but there isn't an easy way to 
> perform datetime addition with years, weeks, hours, minutes, or seconds with 
> the Scala/Python/R APIs.
> Users need to write code like expr("first_datetime + INTERVAL 2 hours") to 
> add two hours to a timestamp with the Scala API, which isn't desirable.  We 
> don't want to make Scala users manipulate SQL strings.
> We can expose the [make_interval SQL 
> function|https://github.com/apache/spark/pull/26446/files] to make any 
> combination of datetime addition possible.  That'll make tons of different 
> datetime addition operations possible and will be valuable for a wide array 
> of users.
> make_interval takes 7 arguments: years, months, weeks, days, hours, mins, and 
> secs.
> There are different ways to expose the make_interval functionality to 
> Scala/Python/R users:
>  * Option 1: Single make_interval function that takes 7 arguments
>  * Option 2: expose a few interval functions
>  ** make_date_interval function that takes years, months, days
>  ** make_time_interval function that takes hours, minutes, seconds
>  ** make_datetime_interval function that takes years, months, days, hours, 
> minutes, seconds
>  * Option 3: expose add_years, add_months, add_days, add_weeks, add_hours, 
> add_minutes, and add_seconds as Column methods.  
>  * Option 4: Expose the add_years, add_hours, etc. as column functions.  
> add_weeks and date_add have already been exposed in this manner.  
> Option 1 is nice from a maintenance perspective cause it's a single function, 
> but it's not standard from a user perspective.  Most languages support 
> datetime instantiation with these arguments: years, months, days, hours, 
> minutes, seconds.  Mixing weeks into the equation is not standard.
> As a user, Option 3 would be my preference.  
> col("first_datetime").addHours(2).addSeconds(30) is easy for me to remember 
> and type.  col("first_datetime") + make_time_interval(lit(2), lit(0), 
> lit(30)) isn't as nice.  col("first_datetime") + make_interval(lit(0), 
> lit(0), lit(0), lit(0), lit(2), lit(0), lit(30)) is harder still.
> Any of these options is an improvement to the status quo.  Let me know what 
> option you think is best and then I'll make a PR to implement it, building 
> off of Max's foundational work of course ;)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33995) Make datetime addition easier for years, weeks, hours, minutes, and seconds

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371610#comment-17371610
 ] 

Apache Spark commented on SPARK-33995:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/33143

> Make datetime addition easier for years, weeks, hours, minutes, and seconds
> ---
>
> Key: SPARK-33995
> URL: https://issues.apache.org/jira/browse/SPARK-33995
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Matthew Powers
>Assignee: Matthew Powers
>Priority: Minor
> Fix For: 3.2.0
>
>
> There are add_months and date_add functions that make it easy to perform 
> datetime addition with months and days, but there isn't an easy way to 
> perform datetime addition with years, weeks, hours, minutes, or seconds with 
> the Scala/Python/R APIs.
> Users need to write code like expr("first_datetime + INTERVAL 2 hours") to 
> add two hours to a timestamp with the Scala API, which isn't desirable.  We 
> don't want to make Scala users manipulate SQL strings.
> We can expose the [make_interval SQL 
> function|https://github.com/apache/spark/pull/26446/files] to make any 
> combination of datetime addition possible.  That'll make tons of different 
> datetime addition operations possible and will be valuable for a wide array 
> of users.
> make_interval takes 7 arguments: years, months, weeks, days, hours, mins, and 
> secs.
> There are different ways to expose the make_interval functionality to 
> Scala/Python/R users:
>  * Option 1: Single make_interval function that takes 7 arguments
>  * Option 2: expose a few interval functions
>  ** make_date_interval function that takes years, months, days
>  ** make_time_interval function that takes hours, minutes, seconds
>  ** make_datetime_interval function that takes years, months, days, hours, 
> minutes, seconds
>  * Option 3: expose add_years, add_months, add_days, add_weeks, add_hours, 
> add_minutes, and add_seconds as Column methods.  
>  * Option 4: Expose the add_years, add_hours, etc. as column functions.  
> add_weeks and date_add have already been exposed in this manner.  
> Option 1 is nice from a maintenance perspective cause it's a single function, 
> but it's not standard from a user perspective.  Most languages support 
> datetime instantiation with these arguments: years, months, days, hours, 
> minutes, seconds.  Mixing weeks into the equation is not standard.
> As a user, Option 3 would be my preference.  
> col("first_datetime").addHours(2).addSeconds(30) is easy for me to remember 
> and type.  col("first_datetime") + make_time_interval(lit(2), lit(0), 
> lit(30)) isn't as nice.  col("first_datetime") + make_interval(lit(0), 
> lit(0), lit(0), lit(0), lit(2), lit(0), lit(30)) is harder still.
> Any of these options is an improvement to the status quo.  Let me know what 
> option you think is best and then I'll make a PR to implement it, building 
> off of Max's foundational work of course ;)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35939) Deprecate Python 3.6 in Spark documentation

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371583#comment-17371583
 ] 

Apache Spark commented on SPARK-35939:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/33141

> Deprecate Python 3.6 in Spark documentation
> ---
>
> Key: SPARK-35939
> URL: https://issues.apache.org/jira/browse/SPARK-35939
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Deprecate Python 3.6 in Spark documentation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33031) scheduler with blacklisting doesn't appear to pick up new executor added

2021-06-29 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371592#comment-17371592
 ] 

Thomas Graves commented on SPARK-33031:
---

I do not think its resolved but haven't tried it lately.  it sounds like its 
just a UI issue so if you want to try it out and still see the problem, feel 
free to work on it.

> scheduler with blacklisting doesn't appear to pick up new executor added
> 
>
> Key: SPARK-33031
> URL: https://issues.apache.org/jira/browse/SPARK-33031
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I was running a test with blacklisting  standalone mode and all the executors 
> were initially blacklisted.  Then one of the executors died and we got 
> allocated another one. The scheduler did not appear to pick up the new one 
> and try to schedule on it though.
> You can reproduce this by starting a master and slave on a single node, then 
> launch a shell like where you will get multiple executors (in this case I got 
> 3)
> $SPARK_HOME/bin/spark-shell --master spark://yourhost:7077 --executor-cores 4 
> --conf spark.blacklist.enabled=true
> From shell run:
> {code:java}
> import org.apache.spark.TaskContext
> val rdd = sc.makeRDD(1 to 1000, 5).mapPartitions { it =>
>  val context = TaskContext.get()
>  if (context.attemptNumber() < 2) {
>  throw new Exception("test attempt num")
>  }
>  it
> }
> rdd.collect(){code}
>  
> Note that I tried both with and without dynamic allocation enabled.
>  
> You can see screen shot related on 
> https://issues.apache.org/jira/browse/SPARK-33029



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35940) Refactor EquivalentExpressions to make it more efficient

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371590#comment-17371590
 ] 

Apache Spark commented on SPARK-35940:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/33142

> Refactor EquivalentExpressions to make it more efficient
> 
>
> Key: SPARK-35940
> URL: https://issues.apache.org/jira/browse/SPARK-35940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35940) Refactor EquivalentExpressions to make it more efficient

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371589#comment-17371589
 ] 

Apache Spark commented on SPARK-35940:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/33142

> Refactor EquivalentExpressions to make it more efficient
> 
>
> Key: SPARK-35940
> URL: https://issues.apache.org/jira/browse/SPARK-35940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35940) Refactor EquivalentExpressions to make it more efficient

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35940:


Assignee: (was: Apache Spark)

> Refactor EquivalentExpressions to make it more efficient
> 
>
> Key: SPARK-35940
> URL: https://issues.apache.org/jira/browse/SPARK-35940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35940) Refactor EquivalentExpressions to make it more efficient

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35940:


Assignee: Apache Spark

> Refactor EquivalentExpressions to make it more efficient
> 
>
> Key: SPARK-35940
> URL: https://issues.apache.org/jira/browse/SPARK-35940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35721) Path level discover for python unittests

2021-06-29 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-35721:
--
Fix Version/s: (was: 3.2.0)

> Path level discover for python unittests
> 
>
> Key: SPARK-35721
> URL: https://issues.apache.org/jira/browse/SPARK-35721
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> Now we need to specify the python test cases by manually when we add a new 
> testcase. Sometime, we forgot to add the testcase to module list, the 
> testcase would not be executed.
> Such as:
>  * pyspark-core pyspark.tests.test_pin_thread
> Thus we need some auto-discover way to find all testcase rather than 
> specified every case by manually.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-35721) Path level discover for python unittests

2021-06-29 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reopened SPARK-35721:
---
  Assignee: (was: Yikun Jiang)

The commit was reverted 
https://github.com/apache/spark/commit/1f6e2f55d7896c9128f80a8f1ed4c317244d013b.
See also https://github.com/apache/spark/pull/32867#issuecomment-870809700

> Path level discover for python unittests
> 
>
> Key: SPARK-35721
> URL: https://issues.apache.org/jira/browse/SPARK-35721
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
> Fix For: 3.2.0
>
>
> Now we need to specify the python test cases by manually when we add a new 
> testcase. Sometime, we forgot to add the testcase to module list, the 
> testcase would not be executed.
> Such as:
>  * pyspark-core pyspark.tests.test_pin_thread
> Thus we need some auto-discover way to find all testcase rather than 
> specified every case by manually.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35940) Refactor EquivalentExpressions to make it more efficient

2021-06-29 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-35940:
---

 Summary: Refactor EquivalentExpressions to make it more efficient
 Key: SPARK-35940
 URL: https://issues.apache.org/jira/browse/SPARK-35940
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35939) Deprecate Python 3.6 in Spark documentation

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35939:


Assignee: (was: Apache Spark)

> Deprecate Python 3.6 in Spark documentation
> ---
>
> Key: SPARK-35939
> URL: https://issues.apache.org/jira/browse/SPARK-35939
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Deprecate Python 3.6 in Spark documentation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35939) Deprecate Python 3.6 in Spark documentation

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35939:


Assignee: Apache Spark

> Deprecate Python 3.6 in Spark documentation
> ---
>
> Key: SPARK-35939
> URL: https://issues.apache.org/jira/browse/SPARK-35939
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Deprecate Python 3.6 in Spark documentation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35881) [SQL] AQE does not support columnar execution for the final query stage

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35881:


Assignee: Apache Spark

> [SQL] AQE does not support columnar execution for the final query stage
> ---
>
> Key: SPARK-35881
> URL: https://issues.apache.org/jira/browse/SPARK-35881
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Andy Grove
>Assignee: Apache Spark
>Priority: Major
>
> In AdaptiveSparkPlanExec, a query is broken down into stages and these stages 
> are executed until the entire query has been executed. These stages can be 
> row-based or columnar. However, the final stage, produced by the private 
> getFinalPhysicalPlan method is always assumed to be row-based. The only way 
> to execute the final stage is by calling the various doExecute methods on 
> AdaptiveSparkPlanExec, and doExecuteColumnar is not implemented. The 
> supportsColumnar method also always returns false.
> In the RAPIDS Accelerator for Apache Spark, we currently call the private 
> getFinalPhysicalPlan method using reflection and then determine if that plan 
> is columnar or not, and then call the appropriate doExecute method, bypassing 
> the doExecute methods on AdaptiveSparkPlanExec. We would like a supported 
> mechanism for executing a columnar AQE plan so that we do not need to use 
> reflection.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35881) [SQL] AQE does not support columnar execution for the final query stage

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35881:


Assignee: (was: Apache Spark)

> [SQL] AQE does not support columnar execution for the final query stage
> ---
>
> Key: SPARK-35881
> URL: https://issues.apache.org/jira/browse/SPARK-35881
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Andy Grove
>Priority: Major
>
> In AdaptiveSparkPlanExec, a query is broken down into stages and these stages 
> are executed until the entire query has been executed. These stages can be 
> row-based or columnar. However, the final stage, produced by the private 
> getFinalPhysicalPlan method is always assumed to be row-based. The only way 
> to execute the final stage is by calling the various doExecute methods on 
> AdaptiveSparkPlanExec, and doExecuteColumnar is not implemented. The 
> supportsColumnar method also always returns false.
> In the RAPIDS Accelerator for Apache Spark, we currently call the private 
> getFinalPhysicalPlan method using reflection and then determine if that plan 
> is columnar or not, and then call the appropriate doExecute method, bypassing 
> the doExecute methods on AdaptiveSparkPlanExec. We would like a supported 
> mechanism for executing a columnar AQE plan so that we do not need to use 
> reflection.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35881) [SQL] AQE does not support columnar execution for the final query stage

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371582#comment-17371582
 ] 

Apache Spark commented on SPARK-35881:
--

User 'andygrove' has created a pull request for this issue:
https://github.com/apache/spark/pull/33140

> [SQL] AQE does not support columnar execution for the final query stage
> ---
>
> Key: SPARK-35881
> URL: https://issues.apache.org/jira/browse/SPARK-35881
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Andy Grove
>Priority: Major
>
> In AdaptiveSparkPlanExec, a query is broken down into stages and these stages 
> are executed until the entire query has been executed. These stages can be 
> row-based or columnar. However, the final stage, produced by the private 
> getFinalPhysicalPlan method is always assumed to be row-based. The only way 
> to execute the final stage is by calling the various doExecute methods on 
> AdaptiveSparkPlanExec, and doExecuteColumnar is not implemented. The 
> supportsColumnar method also always returns false.
> In the RAPIDS Accelerator for Apache Spark, we currently call the private 
> getFinalPhysicalPlan method using reflection and then determine if that plan 
> is columnar or not, and then call the appropriate doExecute method, bypassing 
> the doExecute methods on AdaptiveSparkPlanExec. We would like a supported 
> mechanism for executing a columnar AQE plan so that we do not need to use 
> reflection.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35925) Support DayTimeIntervalType in width-bucket function

2021-06-29 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371576#comment-17371576
 ] 

Max Gekk commented on SPARK-35925:
--

[~xiaopenglei] I don't mind of this but before the release 3.2, I would rather 
focus on feature parities with old (mixed) calendar type CalendarIntervalType. 
Users should be able to substitute old intervals by new ANSI intervals. 

> Support DayTimeIntervalType in width-bucket function
> 
>
> Key: SPARK-35925
> URL: https://issues.apache.org/jira/browse/SPARK-35925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: PengLei
>Priority: Major
> Fix For: 3.2.0
>
>
> At now, width-bucket support the type [DoubleType, DoubleType, DoubleType, 
> LongType],
> we hope that support[DayTimeIntervaType, DayTimeIntervaType, 
> DayTimeIntervaType, LongType]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33349) ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed

2021-06-29 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371569#comment-17371569
 ] 

Dongjoon Hyun commented on SPARK-33349:
---

Hi, [~code_kr_dev_s]. There is no PR in the community for this. Feel free to 
make a PR.

> ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed
> --
>
> Key: SPARK-33349
> URL: https://issues.apache.org/jira/browse/SPARK-33349
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.0.2, 3.1.0
>Reporter: Nicola Bova
>Priority: Critical
>
> I launch my spark application with the 
> [spark-on-kubernetes-operator|https://github.com/GoogleCloudPlatform/spark-on-k8s-operator]
>  with the following yaml file:
> {code:yaml}
> apiVersion: sparkoperator.k8s.io/v1beta2
> kind: SparkApplication
> metadata:
>    name: spark-kafka-streamer-test
>    namespace: kafka2hdfs
> spec: 
>    type: Scala
>    mode: cluster
>    image: /spark:3.0.2-SNAPSHOT-2.12-0.1.0
>    imagePullPolicy: Always
>    timeToLiveSeconds: 259200
>    mainClass: path.to.my.class.KafkaStreamer
>    mainApplicationFile: spark-kafka-streamer_2.12-spark300-assembly.jar
>    sparkVersion: 3.0.1
>    restartPolicy:
>  type: Always
>    sparkConf:
>  "spark.kafka.consumer.cache.capacity": "8192"
>  "spark.kubernetes.memoryOverheadFactor": "0.3"
>    deps:
>    jars:
>  - my
>  - jar
>  - list
>    hadoopConfigMap: hdfs-config
>    driver:
>  cores: 4
>  memory: 12g
>  labels:
>    version: 3.0.1
>  serviceAccount: default
>  javaOptions: 
> "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties"
>   executor:
>  instances: 4
>     cores: 4
>     memory: 16g
>     labels:
>   version: 3.0.1
>     javaOptions: 
> "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties"
> {code}
>  I have tried with both Spark `3.0.1` and `3.0.2-SNAPSHOT` with the ["Restart 
> the watcher when we receive a version changed from 
> k8s"|https://github.com/apache/spark/pull/29533] patch.
> This is the driver log:
> {code}
> 20/11/04 12:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> ... // my app log, it's a structured streaming app reading from kafka and 
> writing to hdfs
> 20/11/04 13:12:12 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
> been closed (this is expected if the application is shutting down.)
> io.fabric8.kubernetes.client.KubernetesClientException: too old resource 
> version: 1574101276 (1574213896)
>  at 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
>  at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
>  at 
> okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
>  at 
> okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
>  at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
>  at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
>  at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
>  at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
> Source)
>  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source)
>  at java.base/java.lang.Thread.run(Unknown Source)
> {code}
> The error above appears after roughly 50 minutes.
> After the exception above, no more logs are produced and the app hangs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35881) [SQL] AQE does not support columnar execution for the final query stage

2021-06-29 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated SPARK-35881:
---
Affects Version/s: 3.0.3
   3.1.2

> [SQL] AQE does not support columnar execution for the final query stage
> ---
>
> Key: SPARK-35881
> URL: https://issues.apache.org/jira/browse/SPARK-35881
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Andy Grove
>Priority: Major
>
> In AdaptiveSparkPlanExec, a query is broken down into stages and these stages 
> are executed until the entire query has been executed. These stages can be 
> row-based or columnar. However, the final stage, produced by the private 
> getFinalPhysicalPlan method is always assumed to be row-based. The only way 
> to execute the final stage is by calling the various doExecute methods on 
> AdaptiveSparkPlanExec, and doExecuteColumnar is not implemented. The 
> supportsColumnar method also always returns false.
> In the RAPIDS Accelerator for Apache Spark, we currently call the private 
> getFinalPhysicalPlan method using reflection and then determine if that plan 
> is columnar or not, and then call the appropriate doExecute method, bypassing 
> the doExecute methods on AdaptiveSparkPlanExec. We would like a supported 
> mechanism for executing a columnar AQE plan so that we do not need to use 
> reflection.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35924) Add Java 17 ea build test to GitHub action

2021-06-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35924.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33126
[https://github.com/apache/spark/pull/33126]

> Add Java 17 ea build test to GitHub action
> --
>
> Key: SPARK-35924
> URL: https://issues.apache.org/jira/browse/SPARK-35924
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.2.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35924) Add Java 17 ea build test to GitHub action

2021-06-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35924:
-

Assignee: William Hyun

> Add Java 17 ea build test to GitHub action
> --
>
> Key: SPARK-35924
> URL: https://issues.apache.org/jira/browse/SPARK-35924
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.2.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35938) Add deprecation warning for Python 3.6

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371562#comment-17371562
 ] 

Apache Spark commented on SPARK-35938:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/33139

> Add deprecation warning for Python 3.6
> --
>
> Key: SPARK-35938
> URL: https://issues.apache.org/jira/browse/SPARK-35938
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Add deprecation warning for Python 3.6.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35938) Add deprecation warning for Python 3.6

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35938:


Assignee: (was: Apache Spark)

> Add deprecation warning for Python 3.6
> --
>
> Key: SPARK-35938
> URL: https://issues.apache.org/jira/browse/SPARK-35938
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Add deprecation warning for Python 3.6.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35938) Add deprecation warning for Python 3.6

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35938:


Assignee: Apache Spark

> Add deprecation warning for Python 3.6
> --
>
> Key: SPARK-35938
> URL: https://issues.apache.org/jira/browse/SPARK-35938
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Add deprecation warning for Python 3.6.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35938) Add deprecation warning for Python 3.6

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371561#comment-17371561
 ] 

Apache Spark commented on SPARK-35938:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/33139

> Add deprecation warning for Python 3.6
> --
>
> Key: SPARK-35938
> URL: https://issues.apache.org/jira/browse/SPARK-35938
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Add deprecation warning for Python 3.6.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35939) Deprecate Python 3.6 in Spark documentation

2021-06-29 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-35939:


 Summary: Deprecate Python 3.6 in Spark documentation
 Key: SPARK-35939
 URL: https://issues.apache.org/jira/browse/SPARK-35939
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng


Deprecate Python 3.6 in Spark documentation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35938) Add deprecation warning for Python 3.6

2021-06-29 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-35938:


 Summary: Add deprecation warning for Python 3.6
 Key: SPARK-35938
 URL: https://issues.apache.org/jira/browse/SPARK-35938
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng


Add deprecation warning for Python 3.6.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35936) Deprecate Python 3.6 support

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371559#comment-17371559
 ] 

Apache Spark commented on SPARK-35936:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/33139

> Deprecate Python 3.6 support
> 
>
> Key: SPARK-35936
> URL: https://issues.apache.org/jira/browse/SPARK-35936
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> According to [https://endoflife.date/python], Python 3.6 will be EOL on 23 
> Dec, 2021.
> We should prepare for the deprecation of Python 3.6 support in Spark in 
> advance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35936) Deprecate Python 3.6 support

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371558#comment-17371558
 ] 

Apache Spark commented on SPARK-35936:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/33139

> Deprecate Python 3.6 support
> 
>
> Key: SPARK-35936
> URL: https://issues.apache.org/jira/browse/SPARK-35936
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> According to [https://endoflife.date/python], Python 3.6 will be EOL on 23 
> Dec, 2021.
> We should prepare for the deprecation of Python 3.6 support in Spark in 
> advance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35936) Deprecate Python 3.6 support

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35936:


Assignee: Apache Spark

> Deprecate Python 3.6 support
> 
>
> Key: SPARK-35936
> URL: https://issues.apache.org/jira/browse/SPARK-35936
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> According to [https://endoflife.date/python], Python 3.6 will be EOL on 23 
> Dec, 2021.
> We should prepare for the deprecation of Python 3.6 support in Spark in 
> advance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35936) Deprecate Python 3.6 support

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35936:


Assignee: (was: Apache Spark)

> Deprecate Python 3.6 support
> 
>
> Key: SPARK-35936
> URL: https://issues.apache.org/jira/browse/SPARK-35936
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> According to [https://endoflife.date/python], Python 3.6 will be EOL on 23 
> Dec, 2021.
> We should prepare for the deprecation of Python 3.6 support in Spark in 
> advance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35906) Remove order by if the maximum number of rows less than or equal to 1

2021-06-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35906:
-

Assignee: Yuming Wang

> Remove order by if the maximum number of rows less than or equal to 1
> -
>
> Key: SPARK-35906
> URL: https://issues.apache.org/jira/browse/SPARK-35906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> {code:scala}
> spark.sql("select count(*) from range(1, 10, 2, 2) order by 1 limit 
> 10").explain("cost")
> {code}
> Current optimized logical plan:
> {noformat}
> == Optimized Logical Plan ==
> Sort [count(1)#2L ASC NULLS FIRST], true, Statistics(sizeInBytes=16.0 B)
> +- Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, 
> rowCount=1)
>+- Project, Statistics(sizeInBytes=20.0 B)
>   +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 
> B, rowCount=5)
> {noformat}
> Expected optimized logical plan:
> {noformat}
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, 
> rowCount=1)
> +- Project, Statistics(sizeInBytes=20.0 B)
>+- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, 
> rowCount=5)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35906) Remove order by if the maximum number of rows less than or equal to 1

2021-06-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35906.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33100
[https://github.com/apache/spark/pull/33100]

> Remove order by if the maximum number of rows less than or equal to 1
> -
>
> Key: SPARK-35906
> URL: https://issues.apache.org/jira/browse/SPARK-35906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> {code:scala}
> spark.sql("select count(*) from range(1, 10, 2, 2) order by 1 limit 
> 10").explain("cost")
> {code}
> Current optimized logical plan:
> {noformat}
> == Optimized Logical Plan ==
> Sort [count(1)#2L ASC NULLS FIRST], true, Statistics(sizeInBytes=16.0 B)
> +- Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, 
> rowCount=1)
>+- Project, Statistics(sizeInBytes=20.0 B)
>   +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 
> B, rowCount=5)
> {noformat}
> Expected optimized logical plan:
> {noformat}
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, 
> rowCount=1)
> +- Project, Statistics(sizeInBytes=20.0 B)
>+- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, 
> rowCount=5)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-35803) Spark SQL does not support creating views using DataSource v2 based data sources

2021-06-29 Thread David Rabinowitz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371548#comment-17371548
 ] 

David Rabinowitz edited comment on SPARK-35803 at 6/29/21, 5:58 PM:


Using regular spark-shell:
{code:java}
spark-shell --packages 
com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.21.1
scala> val df1 = 
spark.read.format("bigquery").load("bigquery-public-data.samples.shakespeare")
df1: org.apache.spark.sql.DataFrame = [word: string, word_count: bigint ... 2 
more fields]

scala> df1.count
res0: Long = 164656 

scala> val df2 = 
spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").load("bigquery-public-data.samples.shakespeare")
df2: org.apache.spark.sql.DataFrame = [word: string, word_count: bigint ... 2 
more fields]

scala> df2.count
res1: Long = 164656
{code}
Using spark-sql:


{code:java}
spark-sql --packages 
com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.21.1
spark-sql> CREATE or REPLACE GLOBAL TEMPORARY VIEW s1 USING bigquery options 
(table 'bigquery-public-data.samples.shakespeare');
Time taken: 2.143 seconds
spark-sql> select count(*) from global_temp.s1;
21/06/29 17:51:34 INFO 
com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation: Querying table 
bigquery-public-data.samples.shakespeare, param
eters sent from Spark: requiredColumns=[], filters=[]
21/06/29 17:51:34 INFO 
com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation: Going to read 
from bigquery-public-data.samples.shakespeare co
lumns=[], filter=''
21/06/29 17:51:34 INFO 
com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation: Used optimized 
BQ count(*) path. Count: 164656
164656
Time taken: 3.767 seconds, Fetched 1 row(s)
spark-sql> CREATE or REPLACE GLOBAL TEMPORARY VIEW s2 USING 
com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2 options (table 
'bigquery-public-data.samples.shakespeare');
Error in query: com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2 is not 
a valid Spark SQL Data Source.;
{code}

Both runs used Spark 2.4.8 with Scala 2.12 (Dataproc image 1.5). The same code 
path exists in Spark 3 as well. The code for the connector is at 
https://github.com/GoogleCloudDataproc/spark-bigquery-connector

 


was (Author: davidrabinowitz):
Using regular spark-shell:
{code:java}
spark-shell --packages 
com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.21.1
scala> val df1 = 
spark.read.format("bigquery").load("bigquery-public-data.samples.shakespeare")
df1: org.apache.spark.sql.DataFrame = [word: string, word_count: bigint ... 2 
more fields]

scala> df1.count
res0: Long = 164656 

scala> val df2 = 
spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").load("bigquery-public-data.samples.shakespeare")
df2: org.apache.spark.sql.DataFrame = [word: string, word_count: bigint ... 2 
more fields]

scala> df2.count
res1: Long = 164656
{code}
Using spark-sql:


{code:java}
spark-sql --packages 
com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.21.1
spark-sql> CREATE or REPLACE GLOBAL TEMPORARY VIEW s1 USING bigquery options 
(table 'bigquery-public-data.samples.shakespeare');
Time taken: 2.143 seconds
spark-sql> select count(*) from global_temp.s1;
21/06/29 17:51:34 INFO 
com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation: Querying table 
bigquery-public-data.samples.shakespeare, param
eters sent from Spark: requiredColumns=[], filters=[]
21/06/29 17:51:34 INFO 
com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation: Going to read 
from bigquery-public-data.samples.shakespeare co
lumns=[], filter=''
21/06/29 17:51:34 INFO 
com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation: Used optimized 
BQ count(*) path. Count: 164656
164656
Time taken: 3.767 seconds, Fetched 1 row(s)
spark-sql> CREATE or REPLACE GLOBAL TEMPORARY VIEW s2 USING 
com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2 options (table 
'bigquery-public-
data.samples.shakespeare');
Error in query: com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2 is not 
a valid Spark SQL Data Source.;
{code}

Both runs used Spark 2.4.8 with Scala 2.12 (Dataproc image 1.5). The same code 
path exists in Spark 3 as well. The code for the connector is at 
https://github.com/GoogleCloudDataproc/spark-bigquery-connector

 

> Spark SQL does not support creating views using DataSource v2 based data 
> sources
> 
>
> Key: SPARK-35803
> URL: https://issues.apache.org/jira/browse/SPARK-35803
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.8, 3.1.2
>Reporter: David Rabinowitz
>

[jira] [Commented] (SPARK-35803) Spark SQL does not support creating views using DataSource v2 based data sources

2021-06-29 Thread David Rabinowitz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371548#comment-17371548
 ] 

David Rabinowitz commented on SPARK-35803:
--

Using regular spark-shell:
{code:java}
spark-shell --packages 
com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.21.1
scala> val df1 = 
spark.read.format("bigquery").load("bigquery-public-data.samples.shakespeare")
df1: org.apache.spark.sql.DataFrame = [word: string, word_count: bigint ... 2 
more fields]

scala> df1.count
res0: Long = 164656 

scala> val df2 = 
spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").load("bigquery-public-data.samples.shakespeare")
df2: org.apache.spark.sql.DataFrame = [word: string, word_count: bigint ... 2 
more fields]

scala> df2.count
res1: Long = 164656
{code}
Using spark-sql:


{code:java}
spark-sql --packages 
com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.21.1
spark-sql> CREATE or REPLACE GLOBAL TEMPORARY VIEW s1 USING bigquery options 
(table 'bigquery-public-data.samples.shakespeare');
Time taken: 2.143 seconds
spark-sql> select count(*) from global_temp.s1;
21/06/29 17:51:34 INFO 
com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation: Querying table 
bigquery-public-data.samples.shakespeare, param
eters sent from Spark: requiredColumns=[], filters=[]
21/06/29 17:51:34 INFO 
com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation: Going to read 
from bigquery-public-data.samples.shakespeare co
lumns=[], filter=''
21/06/29 17:51:34 INFO 
com.google.cloud.spark.bigquery.direct.DirectBigQueryRelation: Used optimized 
BQ count(*) path. Count: 164656
164656
Time taken: 3.767 seconds, Fetched 1 row(s)
spark-sql> CREATE or REPLACE GLOBAL TEMPORARY VIEW s2 USING 
com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2 options (table 
'bigquery-public-
data.samples.shakespeare');
Error in query: com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2 is not 
a valid Spark SQL Data Source.;
{code}

Both runs used Spark 2.4.8 with Scala 2.12 (Dataproc image 1.5). The same code 
path exists in Spark 3 as well. The code for the connector is at 
https://github.com/GoogleCloudDataproc/spark-bigquery-connector

 

> Spark SQL does not support creating views using DataSource v2 based data 
> sources
> 
>
> Key: SPARK-35803
> URL: https://issues.apache.org/jira/browse/SPARK-35803
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.8, 3.1.2
>Reporter: David Rabinowitz
>Priority: Major
>
> When a temporary view is created in Spark SQL using an external data source, 
> Spark then tries to create the relevant relation using 
> DataSource.resolveRelation() method. Unlike DataFrameReader.load(), 
> resolveRelation() does not check if the provided DataSource implements the 
> DataSourceV2 interface and instead tries to use the RelationProvider trait in 
> order to generate the Relation.
> Furthermore, DataSourceV2Relation is not a subclass of BaseRelation, so it 
> cannot be used in resolveRelation().
> Last, I tried to implement the RelationProvider trait in my Java 
> implementation of DataSourceV2, but the match inside resolveRelation() did 
> not detect it as RelationProvider.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35464) pandas APIs on Spark: Enable mypy check "disallow_untyped_defs" for main codes.

2021-06-29 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-35464.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

> pandas APIs on Spark: Enable mypy check "disallow_untyped_defs" for main 
> codes.
> ---
>
> Key: SPARK-35464
> URL: https://issues.apache.org/jira/browse/SPARK-35464
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently many functions in the main codes are still missing type annotations 
> and disabled {{mypy}} check "disallow_untyped_defs".
> We should add more type annotations and enable the {{mypy}} check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35859) Cleanup type hints.

2021-06-29 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-35859.
---
Fix Version/s: 3.2.0
 Assignee: Takuya Ueshin
   Resolution: Fixed

Issue resolved by pull request 33117
https://github.com/apache/spark/pull/33117

> Cleanup type hints.
> ---
>
> Key: SPARK-35859
> URL: https://issues.apache.org/jira/browse/SPARK-35859
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.2.0
>
>
> - Consolidate the declaration of type vars, type aliases, etc.
> - Renam type vars, like {{T_Frame}}, {{T_IndexOps}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32530) SPIP: Kotlin support for Apache Spark

2021-06-29 Thread Jolan Rensen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371543#comment-17371543
 ] 

Jolan Rensen commented on SPARK-32530:
--

I've been helping with the Jetbrains' [Kotlin Spark 
API|https://github.com/JetBrains/kotlin-spark-api] for the last couple of 
months. The API is actually helping me a lot in data analysis for writing my 
Thesis. Personally, I'm not very familiar with Scala, and using Java with Spark 
introduces way too much boilerplate code, in my honest opinion. Kotlin brings a 
straightforward and readable flavor to Spark, which I really like. I most like 
the support for Kotlin's data classes since I mostly use typed Datasets, and in 
Java, having to create an entire class with two constructors, getters, setters, 
and hashcode/equals functions takes too much time.

While most of the Kotlin wrapper functions in de API can be created using 
extension functions relatively easily, a couple of functions prove to be 
difficult to use from Kotlin due to how the Apache Spark API is built. These 
are the functions, for instance, in the Dataset class, where there exist Java- 
and Scala-specific variants.
 Kotlin has [SAM 
conversions|https://kotlinlang.org/docs/java-interop.html#sam-conversions], 
where interfaces that only have one function, like ReduceFunction but also 
scala.Function2, can be instantiated using a lambda function. What we thus 
expect to be able to do is: "myDataset.reduce \{ a, b -> a + b }". However, 
both the Scala and the Java variant of the reduce function match, given this 
statement (and any other statement featuring a similar double function), so the 
compilation fails. 
 This problem also cannot be solved using an extension function in the Kotlin 
Spark API since functions inside a class always have priority over extension 
functions. The only solution currently on our side is to rename the extension 
function to something like "reduceK," but this, of course, is suboptimal.
 A possible solution from your side could be to hide one of the two functions 
from Kotlin using a [Kotlin-specific Deprecation 
annotation|https://kotlinlang.org/api/latest/jvm/stdlib/kotlin/-deprecated/] at 
the HIDDEN level. This would resolve the overload resolution ambiguity. Of 
course, I understand if you don't want to include the Kotlin standard library 
into Spark for this, but I'm sure there are more solutions like it. It might 
even be possible to be solved neatly on our side, so let us know if there's 
something we haven't tried yet!

If Kotlin support is properly added to Spark, this issue should also be taken 
care of :)

> SPIP: Kotlin support for Apache Spark
> -
>
> Key: SPARK-32530
> URL: https://issues.apache.org/jira/browse/SPARK-32530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Pasha Finkeshteyn
>Priority: Major
>
> h2. Background and motivation
> Kotlin is a cross-platform, statically typed, general-purpose JVM language. 
> In the last year more than 5 million developers have used Kotlin in mobile, 
> backend, frontend and scientific development. The number of Kotlin developers 
> grows rapidly every year. 
>  * [According to 
> redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: 
> "Kotlin, the second fastest growing language we’ve seen outside of Swift, 
> made a big splash a year ago at this time when it vaulted eight full spots up 
> the list."
>  * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], 
> Kotlin is the second most popular language on the JVM
>  * [According to 
> StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share 
> increased by 7.8% in 2020.
> We notice the increasing usage of Kotlin in data analysis ([6% of users in 
> 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to 
> 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in 
> 2019), and we expect these numbers to continue to grow. 
> We, authors of this SPIP, strongly believe that making Kotlin API officially 
> available to developers can bring new users to Apache Spark and help some of 
> the existing users.
> h2. Goals
> The goal of this project is to bring first-class support for Kotlin language 
> into the Apache Spark project. We’re going to achieve this by adding one more 
> module to the current Apache Spark distribution.
> h2. Non-goals
> There is no goal to replace any existing language support or to change any 
> existing Apache Spark API.
> At this time, there is no goal to support non-core APIs of Apache Spark like 
> Spark ML and Spark structured streaming. This may change in the future based 
> on community feedback.
> There is no goal to provide CLI for Kotlin for Apache Spark, this will be a 

[jira] [Updated] (SPARK-35936) Deprecate Python 3.6 support

2021-06-29 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-35936:
--
Description: 
According to [https://endoflife.date/python], Python 3.6 will be EOL on 23 Dec, 
2021.

We should prepare for the deprecation of Python 3.6 support in Spark in advance.

  was:
According to [https://endoflife.date/python], Python 3.6 will be deprecated on 
23 Dec, 2021.

We should prepare for the deprecation of Python 3.6 support in Spark in advance.


> Deprecate Python 3.6 support
> 
>
> Key: SPARK-35936
> URL: https://issues.apache.org/jira/browse/SPARK-35936
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> According to [https://endoflife.date/python], Python 3.6 will be EOL on 23 
> Dec, 2021.
> We should prepare for the deprecation of Python 3.6 support in Spark in 
> advance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35936) Deprecate Python 3.6 support

2021-06-29 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-35936:
-
Affects Version/s: (was: 3.1.3)

> Deprecate Python 3.6 support
> 
>
> Key: SPARK-35936
> URL: https://issues.apache.org/jira/browse/SPARK-35936
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>
> According to [https://endoflife.date/python], Python 3.6 will be deprecated 
> on 23 Dec, 2021.
> We should prepare for the deprecation of Python 3.6 support in Spark in 
> advance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35928) Upgrade ASM to 9.1

2021-06-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35928:
-

Assignee: Dongjoon Hyun

> Upgrade ASM to 9.1
> --
>
> Key: SPARK-35928
> URL: https://issues.apache.org/jira/browse/SPARK-35928
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35928) Upgrade ASM to 9.1

2021-06-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35928.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33130
[https://github.com/apache/spark/pull/33130]

> Upgrade ASM to 9.1
> --
>
> Key: SPARK-35928
> URL: https://issues.apache.org/jira/browse/SPARK-35928
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35937) Extracting date field from timestamp should work in ANSI mode

2021-06-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371534#comment-17371534
 ] 

Apache Spark commented on SPARK-35937:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/33138

> Extracting date field from timestamp should work in ANSI mode
> -
>
> Key: SPARK-35937
> URL: https://issues.apache.org/jira/browse/SPARK-35937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
>  Add a new ANSI type coercion rule: when getting a date field from a 
> Timestamp column, cast the column as Date type.
>  This is Spark's hack to make the implementation simple. In the default type 
> coercion rules, the implicit cast rule does the work. However, The ANSI 
> implicit cast rule doesn't allow converting Timestamp type as Date type, so 
> we need to have this additional rule
>  to make sure the date field extraction from Timestamp columns works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35937) Extracting date field from timestamp should work in ANSI mode

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35937:


Assignee: Gengliang Wang  (was: Apache Spark)

> Extracting date field from timestamp should work in ANSI mode
> -
>
> Key: SPARK-35937
> URL: https://issues.apache.org/jira/browse/SPARK-35937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
>  Add a new ANSI type coercion rule: when getting a date field from a 
> Timestamp column, cast the column as Date type.
>  This is Spark's hack to make the implementation simple. In the default type 
> coercion rules, the implicit cast rule does the work. However, The ANSI 
> implicit cast rule doesn't allow converting Timestamp type as Date type, so 
> we need to have this additional rule
>  to make sure the date field extraction from Timestamp columns works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35937) Extracting date field from timestamp should work in ANSI mode

2021-06-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35937:


Assignee: Apache Spark  (was: Gengliang Wang)

> Extracting date field from timestamp should work in ANSI mode
> -
>
> Key: SPARK-35937
> URL: https://issues.apache.org/jira/browse/SPARK-35937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
>  Add a new ANSI type coercion rule: when getting a date field from a 
> Timestamp column, cast the column as Date type.
>  This is Spark's hack to make the implementation simple. In the default type 
> coercion rules, the implicit cast rule does the work. However, The ANSI 
> implicit cast rule doesn't allow converting Timestamp type as Date type, so 
> we need to have this additional rule
>  to make sure the date field extraction from Timestamp columns works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >